Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  百年期刊
Journal of Tsinghua University(Science and Technology)    2022, Vol. 62 Issue (5) : 900-907     DOI: 10.16511/j.cnki.qhdxxb.2022.21.010
SPECIAL SECTION: COMPUTATIONAL LINGUISTICS |
Exploiting image captions and external knowledge as representation enhancement for VQA
WANG Yichao, ZHU Muhua, XU Chen, ZHANG Yan, WANG Huizhen, ZHU Jingbo
Natural Language Processing Lab, School of Computer Science and Engineering, Northeastern University, Shenyang 110000, China
Download: PDF(6382 KB)   HTML
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks    
Abstract  As a multimodal task, visual question answering (VQA) requires a comprehensive understanding of images and questions. However, conducting reasoning simply on images and questions may fail in some cases. Other information that can be used for the task, such as image captions and external knowledge base, exists. A novel approach is proposed in this paper to incorporate information on image captions and external knowledge into VQA models. The proposed approach adopts the co-attention mechanism and encodes image captions with the guidance from the question to utilize image captions. Moreover, the approach incorporates external knowledge by using knowledge graph embedding as the initialization of word embeddings. The above methods enrich the capability of feature representation and model reasoning. Experimental results on the OKVQA dataset show that the proposed method achieves an improvement of 1.71% and 1.88% over the baseline and best-reported previous systems, respectively, which proved the effectiveness of this method.
Keywords visual question answering      multimodal fusion      knowledge graph      image captioning     
Issue Date: 26 April 2022
Service
E-mail this article
E-mail Alert
RSS
Articles by authors
WANG Yichao
ZHU Muhua
XU Chen
ZHANG Yan
WANG Huizhen
ZHU Jingbo
Cite this article:   
WANG Yichao,ZHU Muhua,XU Chen, et al. Exploiting image captions and external knowledge as representation enhancement for VQA[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 900-907.
URL:  
http://jst.tsinghuajournals.com/EN/10.16511/j.cnki.qhdxxb.2022.21.010     OR     http://jst.tsinghuajournals.com/EN/Y2022/V62/I5/900
  
  
  
  
  
  
  
  
[1] AGRAWAL A, LU J S, ANTOL S, et al. VQA:Visual question answering[J]. International Journal of Computer Vision, 2017, 123(1):4-31.
[2] KIM J H, JUN J, ZHANG B T. Bilinear attention networks[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada:Curran Associates Inc., 2018:1571-1581.
[3] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]//3rd International Conference on Learning Representations. San Diego, USA:ICLR, 2015:1-14.
[4] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA:IEEE Press, 2016:770-778.
[5] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//1st International Conference on Learning Representations. Scottsdale, USA:ICLR, 2013:1-12.
[6] PENNINGTON J, SOCHER R, MANNING C D. Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar:ACL, 2014:1532-1543.
[7] DEVLIN J, CHANG M W, LEE K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Human Language Technologies, Volume 1(Long and Short Papers). Minneapolis, Minnesota:ACL, 2019:4171-4186.
[8] YU Z, YU J, CUI Y H, et al. Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA:IEEE Press, 2019:6274-6283.
[9] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA:IEEE Press, 2018:6077-6086.
[10] WANG P, WU Q, SHEN C H, et al. FVQA:Fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(10):2413-2427.
[11] NARASIMHAN M, LAZEBNIK S, SCHWING A G. Out of the box:Reasoning with graph convolution nets for factual visual question answering[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada:Curran Associates Inc., 2018:2659-2670.
[12] ZHU Z H, YU J, WANG Y J, et al. Mucko:Multi-layer cross-modal knowledge reasoning for fact-based visual question answering[C]//Twenty-Ninth International Joint Conference on Artificial Intelligence. Yokohama, Japan:ijcai.org, 2020:1097-1103.
[13] MALAVIYA C, BHAGAVATULA C, BOSSELUT A, et al. Commonsense knowledge base completion with structural and semantic context[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA:AAAI Press, 2020:2925-2933.
[14] LIU H, SINGH P. ConceptNet:A practical commonsense reasoning tool-kit[J]. BT Technology Journal, 2004, 22(4):211-226.
[15] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA:A visual question answering benchmark requiring external knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA:IEEE, 2019:3190-3199.
[16] GUO W Y, ZHANG Y, WU X P, et al. Re-attention for visual question answering[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(1):91-98.
[17] KRISHNA R, ZHU Y K, GROTH O, et al. Visual genome:Connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73.
[18] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montréal, Canada:MIT Press, 2015:91-99.
[19] LUO R T, SHAKHNAROVICH G, COHEN S, et al. Discriminability objective for training descriptive captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA:IEEE, 2018:6964-6974.
[20] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA:Curran Associates Inc., 2017:6000-6010.
[21] WANG P, WU Q, SHEN C H, et al. Explicit knowledge-based reasoning for visual question answering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia:ijcai.org, 2017:1290-1296.
[1] HU Minghao, WANG Fang, XU Xiantao, LUO Wei, LIU Xiaopeng, LUO Zhunchen, Tan Yushan. Two-stage open information extraction method for the defence technology field[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(9): 1309-1316.
[2] CHEN Chuangang, HU Jinqiu, HAN Zicong, CHEN Yiyue, XIAO Shangrui. Knowledge graph based early warning method for accident evolution in natural gas pipeline station abroad for harsh environmental conditions[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(6): 1081-1087.
[3] WANG Liping, ZHANG Chao, CAI Enlei, SHI Huijie, WANG Dong. Knowledge extraction and knowledge base construction method from industrial software packages[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 978-986.
[4] LUO Zhihao, LI Jin, YUE Kun, MAO Yuyuan, LIU Yan. Mining Top-k summarization patterns for knowledge graphs[J]. Journal of Tsinghua University(Science and Technology), 2019, 59(3): 194-202.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
Copyright © Journal of Tsinghua University(Science and Technology), All Rights Reserved.
Powered by Beijing Magtech Co. Ltd