清华大学学报(自然科学版)  2022, Vol. 62 Issue (5): 900-907    DOI: 10.16511/j.cnki.qhdxxb.2022.21.010
王屹超, 朱慕华, 许晨, 张琰, 王会珍, 朱靖波
东北大学 计算机科学与工程学院, 自然语言处理实验室, 沈阳 110000
Exploiting image captions and external knowledge as representation enhancement for VQA
WANG Yichao, ZHU Muhua, XU Chen, ZHANG Yan, WANG Huizhen, ZHU Jingbo
Natural Language Processing Lab, School of Computer Science and Engineering, Northeastern University, Shenyang 110000, China
摘要 视觉问答作为多模态任务,需要深度理解图像和文本问题从而推理出答案。然而在许多情况下,仅在图像和问题上进行简单推理难以得到正确的答案,事实上还有其他有效的信息如图像描述、外部知识等可以被利用。该文提出了利用图像描述和外部知识增强表示的视觉问答模型。该模型以问题为导向,基于协同注意力机制分别在图像和其描述上进行编码,并且利用知识图谱嵌入,将外部知识编码到模型当中,丰富了模型的特征表示,增强了模型的推理能力。在OKVQA数据集上的实验结果表明,该方法相比基线方法有1.71%的准确率提升,与已有的主流模型相比也有1.88%的准确率提升,证明了该方法的有效性。
关键词 视觉问答多模态融合知识图谱图像描述    
Abstract:As a multimodal task, visual question answering (VQA) requires a comprehensive understanding of images and questions. However, conducting reasoning simply on images and questions may fail in some cases. Other information that can be used for the task, such as image captions and external knowledge base, exists. A novel approach is proposed in this paper to incorporate information on image captions and external knowledge into VQA models. The proposed approach adopts the co-attention mechanism and encodes image captions with the guidance from the question to utilize image captions. Moreover, the approach incorporates external knowledge by using knowledge graph embedding as the initialization of word embeddings. The above methods enrich the capability of feature representation and model reasoning. Experimental results on the OKVQA dataset show that the proposed method achieves an improvement of 1.71% and 1.88% over the baseline and best-reported previous systems, respectively, which proved the effectiveness of this method.
Key wordsvisual question answering    multimodal fusion    knowledge graph    image captioning
收稿日期: 2021-12-16      出版日期: 2022-04-26
通讯作者: 王会珍,讲师,      E-mail:
作者简介: 王屹超(1996—),男,硕士研究生。
王屹超, 朱慕华, 许晨, 张琰, 王会珍, 朱靖波. 利用图像描述与知识图谱增强表示的视觉问答[J]. 清华大学学报(自然科学版), 2022, 62(5): 900-907.
WANG Yichao, ZHU Muhua, XU Chen, ZHANG Yan, WANG Huizhen, ZHU Jingbo. Exploiting image captions and external knowledge as representation enhancement for VQA. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 900-907.
