面向不可回答问题的大语言模型拒答能力综合评估

韩泰来; 谈川源; 邵文彪; 熊浩; 陈文亮

doi:10.16511/j.cnki.qhdxxb.2025.21.051

PDF(2807 KB)

清华大学学报（自然科学版） ›› 2026, Vol. 66 ›› Issue (5) : 967-976. DOI: 10.16511/j.cnki.qhdxxb.2025.21.051

知识图谱与语义计算

面向不可回答问题的大语言模型拒答能力综合评估

韩泰来, 谈川源, 邵文彪, 熊浩, 陈文亮

作者信息 +

Comprehensive evaluation of large language models abstention capability for unanswerable questions

HAN Tailai, TAN Chuanyuan, SHAO Wenbiao, XIONG Hao, CHEN Wenliang

Author information +

文章历史 +

摘要

不可回答问题是指因信息不足、不可靠或不连贯而需避免确定性回答的问题。大语言模型在面对不可回答问题时的拒答能力在一些安全相关的领域至关重要,因此需要系统性地评估模型的拒答能力。该文整合了5个不可回答问题的公开数据集,根据数据特点将其分为事实性数据集和非事实性数据集,并构建了1个统一的评估框架,涵盖直接问答和思维链提示2种方式,对多个常用大语言模型在二分类任务上进行了拒答能力的测试。实验发现,对于事实性数据集,更大参数规模的模型有着较好的拒答能力;而非事实性数据集对所有模型仍是挑战。同时,该文还在开放问答任务上进行了模型拒答能力的测试,大参数规模模型在二分类和开放问答任务上比小参数规模模型表现出了更高的鲁棒性。

Abstract

[Objective] This study systematically evaluated the ability of large language models (LLMs) to abstain from answering unanswerable questions—those that lack sufficient, reliable, or coherent information for a definitive response. The goal was to unify diverse unanswerable-question datasets and testing paradigms to examine how model scale, architecture, and prompting strategies influence abstention behavior across both factual and nonfactual scenarios. [Methods] Five representative datasets were categorized as either factual unanswerable or nonfactual unanswerable. Two task paradigms were defined: (1) a binary-classification task requiring explicit “Yes/No” judgments on answerability and (2) an open-domain generation task requiring natural language answers or an abstention token when appropriate. Two prompting strategies were compared—direct prompting and chain-of-thought (CoT) prompting, where CoT prompting required intermediate reasoning steps before a final judgment. Experiments were conducted in zero-shot settings with the temperature fixed at 0. Models evaluated included both open-source and proprietary LLMs spanning small to large parameter scales. Performance metrics included overall accuracy (Acc), accuracy on unanswerable items (AcU), accuracy on answerable items (AcA), and F1 score. Outputs were parsed using standardized rules to detect explicit abstentions and typical abstention-related phrases. [Results] The performance gap between large and small LLMs was limited on nonfactual unanswerable datasets. Larger models often produced more fluent but incorrect answers, reflecting a tendency to rely on linguistic fluency rather than true abstention capability. Conversely, the models performed better on factual unanswerable datasets: most achieved >70% AcU on FalseQA and NEC, and larger models showed higher F1 scores with a balanced trade-off between AcA and AcU. However, the UAQFact dataset remained challenging—even GPT-4o achieved only a 72.03% F1 score, with notably lower AcA, indicating that multifact reasoning and temporal consistency still pose challenges. Prompting strategies also played a significant role. CoT prompting improved accuracy and stability for some models, such as Qwen2.5-7B, Qwen-Plus, and GPT-4o; but for others (e.g., Llama2-7B and DeepSeek-v3), direct prompting yielded higher F1 scores, suggesting that solvability judgment can benefit from concise prompts, while CoT reasoning may introduce redundant steps that obscure decision boundaries. This study also suggests that the LLM performance generally improves with scale but not in a linear manner. Some larger LLMs prioritize answering ability at the expense of abstention capability, reducing robustness and safety. Version upgrades also do not consistently improve the F1 score, indicating limited gains from standard iteration. As per this study, the binary classification and open-domain tasks should be considered when using small LLMs. The results further demonstrate that binary classification does not necessarily make models more susceptible to abstention, ensuring that the evaluation framework does not overestimate model safety. [Conclusions] Under a unified evaluation framework, LLMs exhibited meaningful progress in refusing factual unanswerable questions but remained unreliable on nonfactual unanswerable items. Abstention capability was found to depend not only on scale but also on model alignment, instruction tuning, and prompt design substantially influence outcomes. CoT prompting is not universally beneficial and can help or harm the refusal behavior. These findings indicate that targeted training and evaluation methods are required to improve LLM reliability in real-world scenarios in which safe abstention is critical.

导出引用

韩泰来, 谈川源, 邵文彪, 熊浩, 陈文亮. 面向不可回答问题的大语言模型拒答能力综合评估[J]. 清华大学学报（自然科学版）. 2026, 66(5): 967-976 https://doi.org/10.16511/j.cnki.qhdxxb.2025.21.051

HAN Tailai, TAN Chuanyuan, SHAO Wenbiao, XIONG Hao, CHEN Wenliang. Comprehensive evaluation of large language models abstention capability for unanswerable questions[J]. Journal of Tsinghua University(Science and Technology). 2026, 66(5): 967-976 https://doi.org/10.16511/j.cnki.qhdxxb.2025.21.051

中图分类号： TP391.1

参考文献

[1] BAHRINI A, KHAMOSHIFAR M, ABBASIMEHR H, et al. ChatGPT: Applications, opportunities, and threats[C]// 2023 Systems and Information Engineering Design Symposium (SIEDS). Charlottesville, USA: IEEE, 2023: 274-279.
[2] TOUVRON H, MARTIN L, STONE K, et al. Llama 2: Open foundation and fine-tuned chat models[EB/OL]. (2023-07-19)[2025-05-23]. https://arxiv.org/abs/2307.09288.
[3] JIANG A Q, SABLAYROLLIES A, ROUX A, et al. Mixtral of experts[EB/OL]. (2024-01-08)[2025-05-23]. https://arxiv.org/abs/2401.04088.
[4] YIN Z Y, SUN Q S, GUO Q P, et al. Do large language models know what they don't know?[C]// Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, 2023: 8653-8665.
[5] HU S D, LUO Y F, WANG H D, et al. Won't get fooled again: Answering questions with false premises[C]// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: Association for Computational Linguistics, 2023: 5626-5643.
[6] MADHUSUDHAN N, MADHUSUDHAN S T, YADAV V, et al. Do LLMs know when to not answer? Investigating abstention abilities of large language models[C]// Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE: Association for Computational Linguistics, 2025: 9329-9345.
[7] WEI J, WANG X Z, SCHUURMANS D, et al. Chain-of-thought prompting elicits reasoning in large language models[C]// Proceedings of the 36th International Conference on Neural Information Processing Systems. New Orleans, USA: Curran Associates Inc., 2022: 1800.
[8] RAJPURKAR P, ZHANG J, LOPYREV K, et al. SQuAD: 100,000+ questions for machine comprehension of text[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, USA: Association for Computational Linguistics, 2016: 2383-2392.
[9] RAJPURKAR P, JIA R, LIANG P. Know what you don't know: Unanswerable questions for SQuAD[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, 2018: 784-789.
[10] KARPUKHIN V, OGUZ B, MIN S, et al. Dense passage retrieval for open-domain question answering[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020: 6769-6781.
[11] ROBERTS A, RAFFEL C, SHAZEER N. How much knowledge can you pack into the parameters of a language model?[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, 2020: 5418-5426.
[12] ZHANG Y, LI Y F, CUI L Y, et al. Siren's song in the AI ocean: A survey on hallucination in large language models[EB/OL]. (2023-09-03)[2025-05-24]. https://arxiv.org/abs/2309.01219.
[13] CAO L. Learn to refuse: Making large language models more controllable and reliable through knowledge scope limitation and refusal mechanism[C]// Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, USA: Association for Computational Linguistics, 2024: 3628-3646.
[14] ARDITI A, OBESO O, SYED A Q, et al. Refusal in language models is mediated by a single direction[C]// Proceedings of the 38th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2024: 4322.
[15] ZHANG H N, DIAO S Z, LIN Y, et al. R-Tuning: Instructing large language models to say 'I don't know’[C]// Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics, 2024: 7113-7139.
[16] LIU G L, WANG X Y, YUAN L F, et al. Examining LLMs' uncertainty expression towards questions outside parametric knowledge[EB/OL]. (2024-02-16)[2025-05-24]. https:// arxiv.org/abs/2311.09731.
[17] TAN C, SHAO W, XIONG H, et al. UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions[C]// Findings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, 2025: 1700-1715.
[18] AMAYUELAS A, WONG K, PAN L M, et al. Knowledge of knowledge: Exploring known-unknowns uncertainty with large language models[C]// Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 2024: 6416-6432.

基金

国家自然科学基金面上项目(62376177)

PDF(2807 KB)

Accesses

Citation

Detail

段落导航

收稿日期	出版日期
2025-08-19	2026-05-15
发布日期
2026-05-16

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

基金

访问统计

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

摘要

Abstract

关键词

Key words

引用本文

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

基金

访问统计