Abstract:As a multimodal task, visual question answering (VQA) requires a comprehensive understanding of images and questions. However, conducting reasoning simply on images and questions may fail in some cases. Other information that can be used for the task, such as image captions and external knowledge base, exists. A novel approach is proposed in this paper to incorporate information on image captions and external knowledge into VQA models. The proposed approach adopts the co-attention mechanism and encodes image captions with the guidance from the question to utilize image captions. Moreover, the approach incorporates external knowledge by using knowledge graph embedding as the initialization of word embeddings. The above methods enrich the capability of feature representation and model reasoning. Experimental results on the OKVQA dataset show that the proposed method achieves an improvement of 1.71% and 1.88% over the baseline and best-reported previous systems, respectively, which proved the effectiveness of this method.
[1] AGRAWAL A, LU J S, ANTOL S, et al. VQA:Visual question answering[J]. International Journal of Computer Vision, 2017, 123(1):4-31. [2] KIM J H, JUN J, ZHANG B T. Bilinear attention networks[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada:Curran Associates Inc., 2018:1571-1581. [3] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[C]//3rd International Conference on Learning Representations. San Diego, USA:ICLR, 2015:1-14. [4] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA:IEEE Press, 2016:770-778. [5] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//1st International Conference on Learning Representations. Scottsdale, USA:ICLR, 2013:1-12. [6] PENNINGTON J, SOCHER R, MANNING C D. Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar:ACL, 2014:1532-1543. [7] DEVLIN J, CHANG M W, LEE K, et al. BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Human Language Technologies, Volume 1(Long and Short Papers). Minneapolis, Minnesota:ACL, 2019:4171-4186. [8] YU Z, YU J, CUI Y H, et al. Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA:IEEE Press, 2019:6274-6283. [9] ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA:IEEE Press, 2018:6077-6086. [10] WANG P, WU Q, SHEN C H, et al. FVQA:Fact-based visual question answering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(10):2413-2427. [11] NARASIMHAN M, LAZEBNIK S, SCHWING A G. Out of the box:Reasoning with graph convolution nets for factual visual question answering[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada:Curran Associates Inc., 2018:2659-2670. [12] ZHU Z H, YU J, WANG Y J, et al. Mucko:Multi-layer cross-modal knowledge reasoning for fact-based visual question answering[C]//Twenty-Ninth International Joint Conference on Artificial Intelligence. Yokohama, Japan:ijcai.org, 2020:1097-1103. [13] MALAVIYA C, BHAGAVATULA C, BOSSELUT A, et al. Commonsense knowledge base completion with structural and semantic context[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA:AAAI Press, 2020:2925-2933. [14] LIU H, SINGH P. ConceptNet:A practical commonsense reasoning tool-kit[J]. BT Technology Journal, 2004, 22(4):211-226. [15] MARINO K, RASTEGARI M, FARHADI A, et al. OK-VQA:A visual question answering benchmark requiring external knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA:IEEE, 2019:3190-3199. [16] GUO W Y, ZHANG Y, WU X P, et al. Re-attention for visual question answering[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(1):91-98. [17] KRISHNA R, ZHU Y K, GROTH O, et al. Visual genome:Connecting language and vision using crowdsourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1):32-73. [18] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN:towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Montréal, Canada:MIT Press, 2015:91-99. [19] LUO R T, SHAKHNAROVICH G, COHEN S, et al. Discriminability objective for training descriptive captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA:IEEE, 2018:6964-6974. [20] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA:Curran Associates Inc., 2017:6000-6010. [21] WANG P, WU Q, SHEN C H, et al. Explicit knowledge-based reasoning for visual question answering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence. Melbourne, Australia:ijcai.org, 2017:1290-1296.