将大语言模型(LLM)应用于基于知识的视觉问答已取得了显著的成果。然而,特定领域尤其是电力行业由于模型缺乏输入语料等,传统方法在构建LLM知识提示时仍存在诸多不足,难以充分发挥LLM在垂直领域中的潜能。此外,现有工作多采用GPT-4等模型进行问答推理,成本高,耗时长。该文提出了电力领域基于上下文知识提示的LLM视觉问答技术。首先,采用基础视觉问答模型来生成上下文知识示例,并引入答案选择层以生成候选答案;然后,将上下文知识示例与候选答案融入提示词中,以激发LLM在电力视觉问答任务中的潜能;最后,通过设计的LLM上下文知识提示词来生成预测的答案。此外,该文选择开源且免费的LLaMA来替代GPT-4进行视觉问答任务,并且构建了一个小型数据集用于微调LLM。与已有一些先进方法相比,该方法在EVQA和A-OKVQA数据集上的准确率分别提升了8.8%和14.5%以上。
Abstract
[Objective] Significant progress has been made in applying large language models (LLMs) to knowledge-based visual question answering (VQA), where systems jointly reason over visual content and external knowledge to produce accurate answers. However, existing approaches are limited in specialized vertical domains, particularly the power industry. A major challenge lies in framing effective prompts for LLMs. Given the scarcity of domain-specific textual corpora and the highly technical nature of power industry system operations, traditional prompt engineering methods often fail to provide sufficient contextual grounding. Consequently, even powerful general-purpose LLMs are unable to fully exploit their reasoning capabilities, resulting in suboptimal performance and limited practical utility. Moreover, most existing studies rely heavily on proprietary, closed-source models, such as GPT-4, for inference in VQA tasks. Despite these models' impressive zero-shot capabilities, their use incurs substantial computational costs, application programming interface latency, and a reliance on third-party services, hindering scalability, reproducibility, and real-world deployment, particularly in industrial settings that require data privacy, low-latency responses, and cost efficiency. These constraints underscore the need for an open, efficient, and domain-adapted alternative that can deliver high accuracy without sacrificing autonomy or affordability. [Methods] This paper proposes a novel large-scale model-based visual question-answering framework that is tailored to the power industry and centered on contextual knowledge prompting. This method leverages a foundational vision-language model that generates initial contextual knowledge examples from input image-question pairs. These examples encapsulate relevant visual semantics and preliminary reasoning traces. Subsequently, we introduce a lightweight answer selection layer that produces a set of plausible candidate answers from multimodal features. Crucially, the generated contextual knowledge examples and candidate answers are dynamically integrated into a structured prompt template, which is then fed to an LLM for final reasoning and answer refinement. This design effectively bridges the gap between generic visual understanding and domain-specific knowledge, enabling the LLM to “reason with context” rather than relying on its internal (and often incomplete) pre-trained knowledge. In alignment with our goals of accessibility and sustainability, we deliberately use LLaMA, an open-source, freely available LLM, as the backbone of our system, replacing expensive alternatives such as GPT-4. To further enhance domain adaptation, we curate a small but high-quality dataset comprising annotated image-question-answer triples from real-world power infrastructure scenarios (e.g., substation equipment identification, fault diagnosis from thermal images, and safety compliance checks). This dataset is used for finetuning the LLaMA-based VQA pipeline using parameter-efficient techniques, such as low-rank adaptation, to achieve rapid adaptation with minimal computational overhead. [Results] We evaluate our proposed method on two established knowledge-intensive VQA benchmarks: EVQA and A-OKVQA. The experimental results demonstrate that our contextual knowledge-prompting strategy significantly outperforms state-of-the-art baselines, achieving absolute accuracy gains of 8.8% on EVQA and 14.5% on A-OKVQA, validating the efficacy of our prompt construction mechanism and the viability of open-source LLMs in specialized industrial applications. [Conclusions] This work advances the technical frontier of domain-specific VQA and provides a practical, cost-effective, and reproducible blueprint for deploying large-model intelligence in critical infrastructure sectors.
关键词
电力领域 /
知识提示 /
大语言模型 /
视觉问答
Key words
power domain /
knowledge prompt /
large language model /
visual question answering
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] LI Y. A dynamic knowledge base updating mechanism-Based retrieval-augmented generation framework for intelligent question-and-answer systems [J]. Journal of Computer and Communications, 2025, 13(1): 41-58.
[2] MIN J, BUCH S, NAGRANI A, et al. MoReVQA: Exploring modular reasoning models for video question an-swering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 13235-13245.
[3] YANG Z Y, GAN Z, WANG J F, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA [C]// Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2022: 3081-3089.
[4] SHAO Z W, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE, 2023: 14974-14983.
[5] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020: 159.
[6] CAO Q L, XU Z Q, CHEN Y T, et al. Do-main-controlled prompt learning [C]// Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2024: 936-944.
[7] KHATTAK M U, NAEEM M F, NASEER M, et al. Learn-ing to prompt with text only supervision for vision-language models [C]// Proceedings of the Thirty-Ninth AAAI Confer-ence on Artificial Intelligence. Philadelphia, USA: AAAI Press, 2025: 4230-4238.
[8] HOSSAIN M R I, SIAM M, SIGAL L, et al. Visual prompting for generalized few-shot segmentation: A mul-ti-scale approach [C]// Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 23470-23480.
[9] MITRA C, HUANG B, DARRELL T, et al. Compositional Chain-of-thought prompting for large multimodal models [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 14420-14431.
[10] SENGAR S S, HASAN A B, KUMAR S, et al. Generative artificial intelligence: A systematic review and applications [J]. Multimedia Tools and Applications, 2025, 84(21): 23661-23700.
[11] KALYAN K S. A survey of GPT-3 family large language mod-els including ChatGPT and GPT-4[J]. Natural Language Processing Journal, 2024, 6: 100048.
[12] KUANG J Y, SHEN Y, XIE J Y, et al. Natural language understanding and inference with MLLM in visual question answering: A survey [J]. ACM Computing Surveys, 2025, 57(8): 1-36.
[13] ZHU Y, CHEN D Y, JIA T, et al. A lightweight Trans-former-based visual question answering network with weight-sharing hybrid attention [J]. Neurocomputing, 2024, 608: 128460.
[14] LEE J, CHA S, LEE Y, et al. Visual question answer-ing instruction: Unlocking multimodal large language model to domain-specific visual multitasks [EB/OL]. (2024-02-13) [2025-03-15]. https://arxiv.org/abs/2402.08360.
[15] REZAPOUR M. Cross-attention based text-image transformer for visual question answering [J]. Recent Advances in Com-puter Science and Communications, 2024, 17(4): 72-78.
[16] LU J, BATRA D, PARIKH D, et al. ViLBERT: Pre-training task-agnostic visiolinguistic representations for vi-sion-and- language tasks [C]// Proceedings of the 33rd Con-ference on Neural Information Processing Systems. Vancou-ver, Canada: MIT Press, 2019.
[17] TAN H, BANSAL M. LXMERT: Learning cross-modality encoder representations from transformers [EB/OL]. (2019-08-20)[2025-03-15]. https://arxiv.org/abs/1908.07490.
[18] MOKADY R, HERTZ A, Bermano A H. ClipCap: CLIP prefix for image captioning [EB/OL]. (2021-11-18) [2025-03-15]. https://arxiv.org/abs/2111.09734.
[19] MARINO K, CHEN X L, PARIKH D, et al. KRISP: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE, 2021: 14106-14116.
[20] LU J, CLARK C, ZELLERS R, et al. Unified-IO: A unified model for vision, language, and multi-modal tasks [EB/OL]. (2022-06-17) [2025-03-15]. https://arxiv.org/abs/2206.08916.
[21] RAVI S, CHINCHURE A, SIGAL L, et al. VLC-BERT: Visual question answering with contextualized commonsense knowledge [C]// Proceedings of the IEEE/CVF Winter Con-ference on Applications of Computer Vision. Waikoloa, USA: IEEE, 2023: 1155-1165.
[22] SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: A benchmark for visual question answering using world knowledge [C]// Proceedings of 17th European Conference on Computer Vision-ECCV 2022. Tel Aviv, Israel: Springer, 2022: 146-162.
[23] ANTOL S, AGRAWAL A, LU J, et al. VQA: Visual question answering [C]// Proceedings of the IEEE Interna-tional Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 2425-2433.
[24] ZHU Z H, YU J, WANG Y J, et al. Mucko: Mul-ti-layer cross-modal knowledge reasoning for fact-based visual question answering [C]// Proceedings of the Twen-ty-Ninth International Joint Conference on Artificial Intelli-gence. Yokohama: IJCAI, 2020: 1097-1103.
[25] GARDōRES F, ZIAEEFARD M, ABELOOS B, et al. Con-ceptBert: Concept-aware representation for visual ques-tion answering [C]// Findings of the Association for Compu-tational Linguistics: EMNLP 2020. Association for Com-putational Linguistics, 2020: 489-498.
[26] WU J L, LU J S, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA [C]// Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2022: 2712-2721.
[27] XIE J Y, CAI Y, CHEN J L, et al. Knowledge-augmented visual question answering with natu-ral language explanation [J]. IEEE Transactions on Image Processing, 2024, 33: 2652-2664.
[28] FENG C M, BAI Y, LUO T, et al. VQA4CIR: Boosting composed image retrieval with visual question an-swering [C]// Proceedings of the Thirty-Ninth AAAI Con-ference on Artificial Intelligence. Philadelphia, USA: AAAI Press, 2025: 2942-2950.
[29] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: Elevating the role of image under-standing in visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 6325-6334.
[30] REN L, LIU Y B, OUYANG C P, et al. DyLas: A dynamic label alignment strategy for large-scale multi-label text classification [J]. Information Fusion, 2025, 120: 103081.