Visual question answering technology of large language models based on contextual knowledge prompt in the field of electric power

QIU Yu, FENG Jun, ZHENG Zhehui, ZHAO Yi, SONG Haomin, CHEN Zuge, WANG Shaolan

Journal of Tsinghua University(Science and Technology) ›› 2026, Vol. 66 ›› Issue (5) : 957-966.

PDF(4882 KB)
PDF(4882 KB)
Journal of Tsinghua University(Science and Technology) ›› 2026, Vol. 66 ›› Issue (5) : 957-966. DOI: 10.16511/j.cnki.qhdxxb.2026.21.002
KNOWLEDGE GRAPH AND SEMANTIC COMPUTIING

Visual question answering technology of large language models based on contextual knowledge prompt in the field of electric power

  • {{article.zuoZhe_EN}}
Author information +
History +

Abstract

[Objective] Significant progress has been made in applying large language models (LLMs) to knowledge-based visual question answering (VQA), where systems jointly reason over visual content and external knowledge to produce accurate answers. However, existing approaches are limited in specialized vertical domains, particularly the power industry. A major challenge lies in framing effective prompts for LLMs. Given the scarcity of domain-specific textual corpora and the highly technical nature of power industry system operations, traditional prompt engineering methods often fail to provide sufficient contextual grounding. Consequently, even powerful general-purpose LLMs are unable to fully exploit their reasoning capabilities, resulting in suboptimal performance and limited practical utility. Moreover, most existing studies rely heavily on proprietary, closed-source models, such as GPT-4, for inference in VQA tasks. Despite these models' impressive zero-shot capabilities, their use incurs substantial computational costs, application programming interface latency, and a reliance on third-party services, hindering scalability, reproducibility, and real-world deployment, particularly in industrial settings that require data privacy, low-latency responses, and cost efficiency. These constraints underscore the need for an open, efficient, and domain-adapted alternative that can deliver high accuracy without sacrificing autonomy or affordability. [Methods] This paper proposes a novel large-scale model-based visual question-answering framework that is tailored to the power industry and centered on contextual knowledge prompting. This method leverages a foundational vision-language model that generates initial contextual knowledge examples from input image-question pairs. These examples encapsulate relevant visual semantics and preliminary reasoning traces. Subsequently, we introduce a lightweight answer selection layer that produces a set of plausible candidate answers from multimodal features. Crucially, the generated contextual knowledge examples and candidate answers are dynamically integrated into a structured prompt template, which is then fed to an LLM for final reasoning and answer refinement. This design effectively bridges the gap between generic visual understanding and domain-specific knowledge, enabling the LLM to “reason with context” rather than relying on its internal (and often incomplete) pre-trained knowledge. In alignment with our goals of accessibility and sustainability, we deliberately use LLaMA, an open-source, freely available LLM, as the backbone of our system, replacing expensive alternatives such as GPT-4. To further enhance domain adaptation, we curate a small but high-quality dataset comprising annotated image-question-answer triples from real-world power infrastructure scenarios (e.g., substation equipment identification, fault diagnosis from thermal images, and safety compliance checks). This dataset is used for finetuning the LLaMA-based VQA pipeline using parameter-efficient techniques, such as low-rank adaptation, to achieve rapid adaptation with minimal computational overhead. [Results] We evaluate our proposed method on two established knowledge-intensive VQA benchmarks: EVQA and A-OKVQA. The experimental results demonstrate that our contextual knowledge-prompting strategy significantly outperforms state-of-the-art baselines, achieving absolute accuracy gains of 8.8% on EVQA and 14.5% on A-OKVQA, validating the efficacy of our prompt construction mechanism and the viability of open-source LLMs in specialized industrial applications. [Conclusions] This work advances the technical frontier of domain-specific VQA and provides a practical, cost-effective, and reproducible blueprint for deploying large-model intelligence in critical infrastructure sectors.

Key words

power domain / knowledge prompt / large language model / visual question answering

Cite this article

Download Citations
QIU Yu, FENG Jun, ZHENG Zhehui, ZHAO Yi, SONG Haomin, CHEN Zuge, WANG Shaolan. Visual question answering technology of large language models based on contextual knowledge prompt in the field of electric power[J]. Journal of Tsinghua University(Science and Technology). 2026, 66(5): 957-966 https://doi.org/10.16511/j.cnki.qhdxxb.2026.21.002

References

[1] LI Y. A dynamic knowledge base updating mechanism-Based retrieval-augmented generation framework for intelligent question-and-answer systems [J]. Journal of Computer and Communications, 2025, 13(1): 41-58.
[2] MIN J, BUCH S, NAGRANI A, et al. MoReVQA: Exploring modular reasoning models for video question an-swering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 13235-13245.
[3] YANG Z Y, GAN Z, WANG J F, et al. An empirical study of GPT-3 for few-shot knowledge-based VQA [C]// Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2022: 3081-3089.
[4] SHAO Z W, YU Z, WANG M, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE, 2023: 14974-14983.
[5] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc., 2020: 159.
[6] CAO Q L, XU Z Q, CHEN Y T, et al. Do-main-controlled prompt learning [C]// Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2024: 936-944.
[7] KHATTAK M U, NAEEM M F, NASEER M, et al. Learn-ing to prompt with text only supervision for vision-language models [C]// Proceedings of the Thirty-Ninth AAAI Confer-ence on Artificial Intelligence. Philadelphia, USA: AAAI Press, 2025: 4230-4238.
[8] HOSSAIN M R I, SIAM M, SIGAL L, et al. Visual prompting for generalized few-shot segmentation: A mul-ti-scale approach [C]// Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 23470-23480.
[9] MITRA C, HUANG B, DARRELL T, et al. Compositional Chain-of-thought prompting for large multimodal models [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE, 2024: 14420-14431.
[10] SENGAR S S, HASAN A B, KUMAR S, et al. Generative artificial intelligence: A systematic review and applications [J]. Multimedia Tools and Applications, 2025, 84(21): 23661-23700.
[11] KALYAN K S. A survey of GPT-3 family large language mod-els including ChatGPT and GPT-4[J]. Natural Language Processing Journal, 2024, 6: 100048.
[12] KUANG J Y, SHEN Y, XIE J Y, et al. Natural language understanding and inference with MLLM in visual question answering: A survey [J]. ACM Computing Surveys, 2025, 57(8): 1-36.
[13] ZHU Y, CHEN D Y, JIA T, et al. A lightweight Trans-former-based visual question answering network with weight-sharing hybrid attention [J]. Neurocomputing, 2024, 608: 128460.
[14] LEE J, CHA S, LEE Y, et al. Visual question answer-ing instruction: Unlocking multimodal large language model to domain-specific visual multitasks [EB/OL]. (2024-02-13) [2025-03-15]. https://arxiv.org/abs/2402.08360.
[15] REZAPOUR M. Cross-attention based text-image transformer for visual question answering [J]. Recent Advances in Com-puter Science and Communications, 2024, 17(4): 72-78.
[16] LU J, BATRA D, PARIKH D, et al. ViLBERT: Pre-training task-agnostic visiolinguistic representations for vi-sion-and- language tasks [C]// Proceedings of the 33rd Con-ference on Neural Information Processing Systems. Vancou-ver, Canada: MIT Press, 2019.
[17] TAN H, BANSAL M. LXMERT: Learning cross-modality encoder representations from transformers [EB/OL]. (2019-08-20)[2025-03-15]. https://arxiv.org/abs/1908.07490.
[18] MOKADY R, HERTZ A, Bermano A H. ClipCap: CLIP prefix for image captioning [EB/OL]. (2021-11-18) [2025-03-15]. https://arxiv.org/abs/2111.09734.
[19] MARINO K, CHEN X L, PARIKH D, et al. KRISP: Integrating implicit and symbolic knowledge for open-domain knowledge-based VQA [C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE, 2021: 14106-14116.
[20] LU J, CLARK C, ZELLERS R, et al. Unified-IO: A unified model for vision, language, and multi-modal tasks [EB/OL]. (2022-06-17) [2025-03-15]. https://arxiv.org/abs/2206.08916.
[21] RAVI S, CHINCHURE A, SIGAL L, et al. VLC-BERT: Visual question answering with contextualized commonsense knowledge [C]// Proceedings of the IEEE/CVF Winter Con-ference on Applications of Computer Vision. Waikoloa, USA: IEEE, 2023: 1155-1165.
[22] SCHWENK D, KHANDELWAL A, CLARK C, et al. A-OKVQA: A benchmark for visual question answering using world knowledge [C]// Proceedings of 17th European Conference on Computer Vision-ECCV 2022. Tel Aviv, Israel: Springer, 2022: 146-162.
[23] ANTOL S, AGRAWAL A, LU J, et al. VQA: Visual question answering [C]// Proceedings of the IEEE Interna-tional Conference on Computer Vision. Santiago, Chile: IEEE, 2015: 2425-2433.
[24] ZHU Z H, YU J, WANG Y J, et al. Mucko: Mul-ti-layer cross-modal knowledge reasoning for fact-based visual question answering [C]// Proceedings of the Twen-ty-Ninth International Joint Conference on Artificial Intelli-gence. Yokohama: IJCAI, 2020: 1097-1103.
[25] GARDōRES F, ZIAEEFARD M, ABELOOS B, et al. Con-ceptBert: Concept-aware representation for visual ques-tion answering [C]// Findings of the Association for Compu-tational Linguistics: EMNLP 2020. Association for Com-putational Linguistics, 2020: 489-498.
[26] WU J L, LU J S, SABHARWAL A, et al. Multi-modal answer validation for knowledge-based VQA [C]// Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence. Vancouver, Canada: AAAI Press, 2022: 2712-2721.
[27] XIE J Y, CAI Y, CHEN J L, et al. Knowledge-augmented visual question answering with natu-ral language explanation [J]. IEEE Transactions on Image Processing, 2024, 33: 2652-2664.
[28] FENG C M, BAI Y, LUO T, et al. VQA4CIR: Boosting composed image retrieval with visual question an-swering [C]// Proceedings of the Thirty-Ninth AAAI Con-ference on Artificial Intelligence. Philadelphia, USA: AAAI Press, 2025: 2942-2950.
[29] GOYAL Y, KHOT T, SUMMERS-STAY D, et al. Making the V in VQA matter: Elevating the role of image under-standing in visual question answering [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE, 2017: 6325-6334.
[30] REN L, LIU Y B, OUYANG C P, et al. DyLas: A dynamic label alignment strategy for large-scale multi-label text classification [J]. Information Fusion, 2025, 120: 103081.
PDF(4882 KB)

Accesses

Citation

Detail

Sections
Recommended

/