Self-supervised deep semantics-preserving Hashing for cross-modal retrieval
LU Bo1, DUAN Xiaodong1, YUAN Ye2
1. SECA Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian 116600, China; 2. School of Computer and Technology, Beijing Institute of Technology, Beijing 100081, China
Abstract:The key issue for cross-modal retrieval using cross-modal Hashing is how to maximize the consistency of the semantic relationship for heterogeneous media data. This paper presents a self-supervised deep semantics-preserving hashing network (UDSPH) that generates compact Hash codes using an end-to-end architecture. Two modality-specific hashing networks are first trained for generating the Hash codes and high-level features. The semantic relationship between different modalities is then measured using cross-modal attention mechanisms that maximize preservation of the local semantic correlation. Multi-label semantic information in the training data is used to simultaneously guide the training of two modality-specific Hashing networks by self-supervised adversarial learning. This constructs a deep semantic hashing network that preserves the semantic association in the global view and improves the discriminative capability of the generated Hash codes. Tests on three widely-used benchmark datasets verify the effectiveness of this method.
逯波, 段晓东, 袁野. 面向跨模态检索的自监督深度语义保持Hash[J]. 清华大学学报(自然科学版), 2022, 62(9): 1442-1449.
LU Bo, DUAN Xiaodong, YUAN Ye. Self-supervised deep semantics-preserving Hashing for cross-modal retrieval. Journal of Tsinghua University(Science and Technology), 2022, 62(9): 1442-1449.
[1] WAN J, WANG D Y, HOI S C, et al. Deep learning for content-based image retrieval: A comprehensive study[C]//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando Florida, USA: ACM, 2014: 157-166. [2] ZHUANG Y T, YU Z, WANG W, et al. Cross-media hashing with neural networks[C]//Proceedings of the 22nd ACM International Conference on Multimedia. Orlando Florida, USA: ACM, 2014: 901-904. [3] SONG J, YANG Y, HUANG Z. Inter-media hashing for large-scale retrieval from heterogeneous data sources[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM, 2013: 785-796. [4] DING G G, GUO Y C, ZHOU J L, et al. Large-scale cross-modality search via collective matrix factorization hashing[J]. IEEE Transactions on Image Processing, 2016, 25(11): 5427-5440. [5] BRONSTEIN M M, BRONSTEIN A M, MICHEL F, et al. Data fusion through cross-modality metric learning using similarity-sensitive hashing[C]//Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, USA: IEEE, 2010: 3594-3601. [6] WU B T, YANG Q, ZHENG W S, et al. Quantized correlation hashing for fast cross-modal search[C]//Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina: AAAI Press, 2015: 3946-3952. [7] LECUN Y, BENGIO Y, HINTON G. Deep learning[J]. Nature, 2015, 521(7553): 436-444. [8] ZHANG J, PENG Y X, YUAN M K. Unsupervised generative adversarial cross-modal hashing[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans, USA: AAAI Press, 2018: 539-546. [9] JIANG Q Y, LI W J. Deep cross-modal hashing[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017: 3270-3278. [10] WANG B K, YANG Y, XU X, et al. Adversarial cross-modal retrieval[C]//Proceedings of the 25th ACM International Conference on Multimedia. Mountain View, USA: ACM, 2017: 154-162. [11] LIONG V E, LU J W, TAN Y P, et al. Cross-modal deep variational hashing[C]//Proceeding of the 2017 IEEE International Conference on Computer Vision (ICCV). Venice, Italy: IEEE, 2017: 4097-4105. [12] CHATFIELD C, SIMONYAN K, VEDALDI A, et al. Return of the devil in the details: Delving deep into convolutional nets[C]//Proceedings of the British Machine Vision Conference. Nottingham, UK: BMVA Press, 2014: 1-12. [13] LE Q, MIKOLOV T. Distributed representations of sentences and documents[C]//Proceedings of the 31st International Conference on International Conference on Machine Learning. Beijing, China: JMLR, 2014: 1188-1196. [14] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations. Scottsdale, USA, 2013: 2-11. [15] LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[C]//Proceedings of the 15th European Conference on Computer Vision. Munich, Germany: Springer, 2018: 212-228. [16] HUISKES M J, LEW M S. The MIR flickr retrieval evaluation[C]//Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. Vancouver, Canada: ACM, 2008: 39-43. [17] CHUA T S, TANG J H, HONG R C, et al. NUS-WIDE: A real-world web image database from national university of Singapore[C]//Proceedings of the ACM International Conference on Image and Video Retrieval. Santorini, Greece: ACM, 2009: Article No.: 48. [18] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: Common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 740-755. [19] WANG D, GAO X B, WANG X M, et al. Semantic topic multimodal hashing for cross-media retrieval[C]// Proceedings of the 24th International Conference on Artificial Intelligence. Buenos Aires, Argentina: AAAI Press, 2015: 3890-3896. [20] LIN Z J, DING G G, HU M Q, et al. Semantics-preserving hashing for cross-view retrieval[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, USA: IEEE, 2015: 3864-3872. [21] ZHANG D Q, LI W J. Large-scale supervised multimodal hashing with semantic correlation maximization[C]//Proceedings of the 28th AAAI Conference on Artificial Intelligence. Québec City, Canada: AAAI Press, 2014: 2177-2183.