Unified Benchmark for Evaluating Performance, Bias, and Consistency in LLM Binary Question Answering

Authors

  • Olesia Khrapunova

Keywords:

Benchmarking, Bias, Binary Question Answering, Consistency, Large Language Models, Performance Evaluation

Abstract

Binary question answering is central to many real-world applications of large language models (LLMs), such as fact-checking or decision-making support. Yet, despite its prevalence and the high stakes of getting a binary judgment wrong (where an error yields the exact opposite outcome), there are no recent comprehensive benchmarks dedicated to evaluating LLM behavior on this task. To address this gap, we introduce a unified benchmark for assessing binary QA across three dimensions: performance, bias, and consistency. The benchmark is supported by a five-domain dataset augmented with new controlled reformulations of each question, including paraphrases, negations, and answer option variations. Across fifteen state-of-the-art LLMs, we find strong overall performance on the task, with larger and reasoning-optimized models showing better results than the smaller variants. At the same time, we observe pervasive No-leaning bias, universally weak consistency when handling semantically opposite questions, and substantial cross-domain variation. Reading comprehension and multi-hop reasoning topics are handled reliably, whereas numerical reasoning, ethical judgment, and, especially, translation evaluation remain challenging. These findings reveal both the strengths and shortcomings of current LLMs on binary QA, providing researchers with a basis for targeted future improvements while also helping practitioners make informed choices when deploying the models in binary decision contexts.

Author Biography

  • Olesia Khrapunova

    Senior AI/ML Engineer, Paris, France

References

[1] Ni, S., Chen, G., Li, S., Chen, X., Li, S., Wang, B. et al. “A Survey on Large Language Model Benchmarks,” arXiv:2508.15361v1 [cs.CL], Aug. 2025.

[2] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. “BoolQ: Exploring the Surprising Difficulty of Natural Yes/No questions,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long and Short Papers), Jun. 2019, pp. 2924-2936.

[3] S. Yu, J. Song, B. Hwang, H. Kang, S. Cho, J. Choi et al. “Correcting Negative Bias in Large Language Models through Negative Attention Score Alignment,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Apr. 2025, pp. 9979-10 001.

[4] Y.-L. Lu, C. Zhang, and W. Wang. “Systematic Bias in Large Language Models: Discrepant Response Patterns in Binary vs. Continuous Judgment Tasks,” arXiv:2504.19445v1 [cs.CL], Apr. 2025.

[5] D. Braun. “Acquiescence Bias in Large Language Models,” in Findings of the Association for Computational Linguistics: EMNLP 2025, Nov. 2025, pp. 11 341-11 355.

[6] V. Cheung, M. Maier, and F. Lieder. “Large language models show amplified cognitive biases in moral decision- making,” in Proceedings of the National Academy of Sciences, vol. 122, no. 25, Jun. 2025.

[7] M. Jang, D. S. Kwon, and T. Lukasiewicz. “BECEL: Benchmark for Consistency Evaluation of Language Models,” in Proceedings of the 29th International Conference on Computational Linguistics, Oct. 2022, pp. 3680-3696.

[8] M. Jang and T. Lukasiewicz. “Consistency Analysis of ChatGPT,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec. 2023, pp. 15 970-15 985.

[9] J. J. Ahn and W. Yin. “Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing,” arXiv:2504.01282v2 [cs.CL], Jul. 2025.

[10] B. Atil, S. Aykent, A. Chittams, L. Fu, R. J. Passonneau, E. Radcliffe, et al. “Non-Determinism of ‘Deterministic’ LLM Settings,” arXiv:2408.04667v5 [cs.CL], Apr. 2025.

[11] T. Labruna, S. Gallo, and G. D. S. Martino. “Positional Bias in Binary Question Answering: How Uncertainty Shapes Model Preferences,” arXiv:2506.23743v2 [cs.CL], Jul. 2025

[12] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, et al. “HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Oct. 2018, pp. 2369-2380.

[13] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. “DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs,” in Proceedings of the 2019 Conference of the North American Chapter of the Association of Computational Linguists: Human Language Technologies (Volume1: Long and Short Papers), Jun. 2019, pp. 2368-2378.

[14] M. Freitag, G. Foster, D. Grangier, V. Ratnakar, Q. Tan, and W. Macherey. “Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation,” in Transactions of the Association for Computational Linguistics (Volume 9), 2021, pp. 1460-1474.

[15] D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, et al. “Aligning AI With Shared Human Values,” in Proceedings of the International Conference on Learning Representations (ICLR), 2021.

[16] OpenAI. “Introducing GPT-4.1 in the API.” Internet: https://openai.com/index/gpt-4-1/, Apr. 2025 [Nov. 7, 2025].

[17] OpenAI. “OpenAI o3 and o4-mini System Card.” Internet: https://openai.com/index/o3-o4-mini-system-card/, Apr. 2025 [Nov. 19, 2025].

[18] OpenAI. “GPT-4o System Card.” Internet: https://openai.com/index/gpt-4o-system-card/, Aug. 2024 [Nov. 7, 2025].

[19] OpenAI. “GPT-4o mini: advancing cost-efficient intelligence.” Internet: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/, Jul. 2024 [Nov. 8, 2025].

[20] G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon et al. “Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities,” arXiv:2507.06261v5 [cs.CL], Oct. 2025.

[21] Anthropic. “Claude Opus 4.1.” Internet: https://www.anthropic.com/news/claude-opus-4-1, Aug. 2025 [Nov. 15, 2025].

[22] Anthropic. “Introducing Claude Sonnet 4.5.” Internet: https://www.anthropic.com/news/claude-sonnet-4-5, Sep. 2025 [Nov. 19, 2025].

[23] Anthropic. “Introducing Claude Haiku 4.5.” Internet: https://www.anthropic.com/news/claude-haiku-4-5, Oct. 2025 [Nov. 13, 2025].

[24] Mistral-AI: A. Rastogi, A. Q. Jiang, A. Lo, G. Berrada, G. Lample, J. Rute et al. “Magistral,” arXiv:2506.10910v1 [cs.CL], Jun. 2025.

[25] Mistral-AI. “Medium is the new large.” Internet: https://mistral.ai/news/mistral-medium-3, May 2025 [Nov. 2, 2025].

[26] Mistral-AI. “Mistral Small 3.2.” Internet: https://docs.mistral.ai/models/mistral-small-3-2-25-06, Jun. 2025. [Nov. 2, 2025].

[27] Meta-AI. “The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation.” Internet: https://ai.meta.com/blog/llama-4-multimodal-intelligence/, Apr. 2025 [Nov. 20, 2025].

[28] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, et al. “The Llama 3 Herd of Models,” arXiv:2407.21783v3 [cs.AI], Nov. 2024.

Downloads

Published

2025-12-27

Issue

Section

Articles

How to Cite

Olesia Khrapunova. (2025). Unified Benchmark for Evaluating Performance, Bias, and Consistency in LLM Binary Question Answering. International Journal of Computer (IJC), 56(1), 319-338. https://ijcjournal.org/InternationalJournalOfComputer/article/view/2470