Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2022, Vol. 62 Issue (5): 900-907    DOI: 10.16511/j.cnki.qhdxxb.2022.21.010
  专题:计算语言学 本期目录 | 过刊浏览 | 高级检索 |
利用图像描述与知识图谱增强表示的视觉问答
王屹超, 朱慕华, 许晨, 张琰, 王会珍, 朱靖波
东北大学 计算机科学与工程学院, 自然语言处理实验室, 沈阳 110000
Exploiting image captions and external knowledge as representation enhancement for VQA
WANG Yichao, ZHU Muhua, XU Chen, ZHANG Yan, WANG Huizhen, ZHU Jingbo
Natural Language Processing Lab, School of Computer Science and Engineering, Northeastern University, Shenyang 110000, China
全文: PDF(6382 KB)   HTML
输出: BibTeX | EndNote (RIS)      
摘要 视觉问答作为多模态任务,需要深度理解图像和文本问题从而推理出答案。然而在许多情况下,仅在图像和问题上进行简单推理难以得到正确的答案,事实上还有其他有效的信息如图像描述、外部知识等可以被利用。该文提出了利用图像描述和外部知识增强表示的视觉问答模型。该模型以问题为导向,基于协同注意力机制分别在图像和其描述上进行编码,并且利用知识图谱嵌入,将外部知识编码到模型当中,丰富了模型的特征表示,增强了模型的推理能力。在OKVQA数据集上的实验结果表明,该方法相比基线方法有1.71%的准确率提升,与已有的主流模型相比也有1.88%的准确率提升,证明了该方法的有效性。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王屹超
朱慕华
许晨
张琰
王会珍
朱靖波
关键词 视觉问答多模态融合知识图谱图像描述    
Abstract:As a multimodal task, visual question answering (VQA) requires a comprehensive understanding of images and questions. However, conducting reasoning simply on images and questions may fail in some cases. Other information that can be used for the task, such as image captions and external knowledge base, exists. A novel approach is proposed in this paper to incorporate information on image captions and external knowledge into VQA models. The proposed approach adopts the co-attention mechanism and encodes image captions with the guidance from the question to utilize image captions. Moreover, the approach incorporates external knowledge by using knowledge graph embedding as the initialization of word embeddings. The above methods enrich the capability of feature representation and model reasoning. Experimental results on the OKVQA dataset show that the proposed method achieves an improvement of 1.71% and 1.88% over the baseline and best-reported previous systems, respectively, which proved the effectiveness of this method.
Key wordsvisual question answering    multimodal fusion    knowledge graph    image captioning
收稿日期: 2021-12-16      出版日期: 2022-04-26
基金资助:国家自然科学基金重点项目(61732005);国家自然科学基金面上项目(61876035)
通讯作者: 王会珍,讲师,E-mail:wanghuizhen@mail.neu.edu.cn      E-mail: wanghuizhen@mail.neu.edu.cn
作者简介: 王屹超(1996—),男,硕士研究生。
引用本文:   
王屹超, 朱慕华, 许晨, 张琰, 王会珍, 朱靖波. 利用图像描述与知识图谱增强表示的视觉问答[J]. 清华大学学报(自然科学版), 2022, 62(5): 900-907.
WANG Yichao, ZHU Muhua, XU Chen, ZHANG Yan, WANG Huizhen, ZHU Jingbo. Exploiting image captions and external knowledge as representation enhancement for VQA. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 900-907.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2022.21.010  或          http://jst.tsinghuajournals.com/CN/Y2022/V62/I5/900
  
  
  
  
  
  
  
  
[1] AGRAWAL A, LU J S, ANTOL S, et al. VQA:Visual question answering[J]. International Journal of Computer Vision, 2017, 123(1):4-31.
[2] KIM J H, JUN J, ZHANG B T. Bilinear attention networks[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada:Curran Associates Inc., 2018:1571-1581.
[3] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]//3rd International Conference on Learning Representations. San Diego, USA:ICLR, 2015:1-14.
[4] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA:IEEE Press, 2016:770-778.
[5] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//1st International Conference on Learning Representations. Scottsdale, USA:ICLR, 2013:1-12.
[6] PENNINGTON J, SOCHER R, MANNING C D. Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar:ACL, 2014:1532-1543.
[7] DEVLIN J, CHANG M W, LEE K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Human Language Technologies, Volume 1(Long and Short Papers). Minneapolis, Minnesota:ACL, 2019:4171-4186.
[8] YU Z, YU J, CUI Y H, et al. Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA:IEEE Press, 2019:6274-6283.
[9] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA:IEEE Press, 2018:6077-6086.
[10] WANG P, WU Q, SHEN C H, et al. FVQA:Fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(10):2413-2427.
[11] NARASIMHAN M, LAZEBNIK S, SCHWING A G. Out of the box:Reasoning with graph convolution nets for factual visual question answering[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada:Curran Associates Inc., 2018:2659-2670.
[12] ZHU Z H, YU J, WANG Y J, et al. Mucko:Multi-layer cross-modal knowledge reasoning for fact-based visual question answering[C]//Twenty-Ninth International Joint Conference on Artificial Intelligence. Yokohama, Japan:ijcai.org, 2020:1097-1103.
[13] MALAVIYA C, BHAGAVATULA C, BOSSELUT A, et al. Commonsense knowledge base completion with structural and semantic context[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA:AAAI Press, 2020:2925-2933.
[14] LIU H, SINGH P. ConceptNet:A practical commonsense reasoning tool-kit[J]. BT Technology Journal, 2004, 22(4):211-226.
[15] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA:A visual question answering benchmark requiring external knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA:IEEE, 2019:3190-3199.
[16] GUO W Y, ZHANG Y, WU X P, et al. Re-attention for visual question answering[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(1):91-98.
[17] KRISHNA R, ZHU Y K, GROTH O, et al. Visual genome:Connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73.
[18] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montréal, Canada:MIT Press, 2015:91-99.
[19] LUO R T, SHAKHNAROVICH G, COHEN S, et al. Discriminability objective for training descriptive captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA:IEEE, 2018:6964-6974.
[20] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA:Curran Associates Inc., 2017:6000-6010.
[21] WANG P, WU Q, SHEN C H, et al. Explicit knowledge-based reasoning for visual question answering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia:ijcai.org, 2017:1290-1296.
[1] 胡明昊, 王芳, 徐先涛, 罗威, 刘晓鹏, 罗准辰, 谭玉珊. 国防科技领域两阶段开放信息抽取方法[J]. 清华大学学报(自然科学版), 2023, 63(9): 1309-1316.
[2] 陈传刚, 胡瑾秋, 韩子从, 陈怡玥, 肖尚蕊. 恶劣环境条件下海外天然气管道站场事故演化知识图谱建模及预警方法[J]. 清华大学学报(自然科学版), 2022, 62(6): 1081-1087.
[3] 王立平, 张超, 蔡恩磊, 史慧杰, 王冬. 面向自主工业软件的知识提取和知识库构建方法[J]. 清华大学学报(自然科学版), 2022, 62(5): 978-986.
[4] 罗之皓, 李劲, 岳昆, 毛钰源, 刘琰. 知识图谱的Top-k摘要模式挖掘方法[J]. 清华大学学报(自然科学版), 2019, 59(3): 194-202.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn