多模态信息增强表示的中文关键词抽取方法

周炫余, 刘林, 卢笑, 李璇, 张思敏

清华大学学报(自然科学版) ›› 2024, Vol. 64 ›› Issue (10) : 1785-1796.

PDF(4011 KB)
PDF(4011 KB)
清华大学学报(自然科学版) ›› 2024, Vol. 64 ›› Issue (10) : 1785-1796. DOI: 10.16511/j.cnki.qhdxxb.2024.27.015
专题:大数据分析

多模态信息增强表示的中文关键词抽取方法

  • 周炫余1, 刘林1, 卢笑2, 李璇1, 张思敏1
作者信息 +

A Chinese keyphrase extraction method for multimodal information enhancement representation

  • ZHOU Xuanyu1, LIU Lin1, LU Xiao2, LI Xuan1, ZHANG Siming1
Author information +
文章历史 +

摘要

关键词抽取是指能自动抽取反映文本主题的词或者短语, 被广泛应用于文本检索、 文本摘要等领域中。目前关键词抽取任务主要依赖于预训练语言模型来获取文本表示, 这类语言模型主要基于单一模态的通用文本语料进行训练, 存在无法根据下游任务特性进行领域适配和语义表征能力有限的问题。该文提出一种多模态信息增强表示的中文关键词抽取方法MIEnhance-KPE, 首先引入Adapter层将偏旁和部首信息集成到预训练语言模型层中, 得到领域自适应的文本表示; 其次利用卷积神经网络提取汉字的图像特征, 同时使用交叉注意力机制融合汉字图像特征和文本特征, 实现文本语义表示增强; 最后利用条件随机场(conditional random field, CRF)模型进行序列标注任务, 并计算词语的位置-词频权重对其进行排序获得关键词。与目前十分先进的关键词抽取方法KIEMP相比, MIEnhance-KPE在公开的中文科学文献数据集和自构建的中文教育关键词抽取数据集上的F值分别提升了15.71%和3.40%; 消融实验结果表明, 所提出的领域自适应模块和视觉语义增强表示模块均能有效提高关键词抽取的准确性。MIEnhance-KPE的提出有助于教育研究者精准了解教育发展趋势, 促进教育理论和实践的创新。

Abstract

[Objective] At present, China is undergoing a critical digital transformation in education. This shift has led to an explosive growth of educational content online, presenting a challenge for researchers who find it increasingly difficult to sift through massive amounts of text data. The necessity to quickly grasp important information has made keyphrase extraction an invaluable tool. Keyphrase extraction automates the process of identifying words or phrases that encapsulate the main themes of a text, proving critical for text retrieval, text summary, and other tasks. Despite its importance, the current keyphrase extraction tasks mainly rely on pretrained language models to obtain text representation. These models are often trained based on a generic text corpus and struggle to adapt to specific domains according to the characteristics of downstream tasks owing to their limited ability to capture the subtle semantic representation of single-mode information. Therefore, developing methods for accurate and efficient keyphrase extraction from massive texts remains a pressing research challenge. [Methods] This paper presents a novel approach for Chinese keyphrase extraction, dubbed multimodal information enhancement representation for keyphrase extraction (MIEnhance-KPE). Our method first deconstructs characters into radicals using a character splitting dictionary and extracts radical features through a convolutional neural network. At the same time, we integrate a trainable adapter layer between the transformer layers of a pretrained language model. Through the above operations, the bottom level semantic features of the pretrained language model and radical features are fully integrated to obtain a domain adaptive text representation. Characters are then transformed into glyph images representing different periods in history and writing styles. Subsequently, we employ group convolution to extract the glyphic features of these characters. Meanwhile, a cross-attention mechanism is used to fuse the glyphic and text features, yielding richer and more comprehensive semantic representations. The final step involves using a conditional random field model to learn the relationship between the fused features and labels. Through sequence labeling, we identify candidate keyphrases, ranking them based on position and word frequency weight to determine the most relevant keyphrases. [Results] MIEnhance-KPE's performance was tested using two datasets: the published Chinese Scientific Literature (CSL) and the self-constructed Chinese Education Keyphrase Extraction Dataset (CEKED). Our method demonstrated a substantial improvement compared to the most advanced keyphrase extraction methods, with F values increasing by 15.71% and 3.40% on the CSL and CEKED datasets, respectively. Ablation experiments further confirmed the effectiveness of both the domain adaptive module and the visual semantic enhancement module in enhancing keyphrase extraction accuracy. In addition, this paper explored various methods for fusing glyphic and semantic features, concluding that the cross-attention mechanism excels in adaptively merging different features to improve task accuracy. [Conclusions] The MIEnhance-KPE proposed in this paper can considerably improve the accuracy of keyphrase extraction tasks. This aids educational researchers in quickly locating relevant literature and understanding the cutting-edge trends of educational development. Additionally, MIEnhance-KPE introduces a novel approach to literature analysis in the educational sector. It provides a solid data foundation for examining the motivation of educational reform and innovation, thereby accelerating the digital transformation process in education.

关键词

中文关键词抽取 / 多模态信息 / 多粒度语义特征 / 交叉注意力机制 / 领域自适应

Key words

Chinese keyphrase extraction / multimodal information / multigranularity semantic features / cross-attention mechanism / domain adaptation

引用本文

导出引用
周炫余, 刘林, 卢笑, 李璇, 张思敏. 多模态信息增强表示的中文关键词抽取方法[J]. 清华大学学报(自然科学版). 2024, 64(10): 1785-1796 https://doi.org/10.16511/j.cnki.qhdxxb.2024.27.015
ZHOU Xuanyu, LIU Lin, LU Xiao, LI Xuan, ZHANG Siming. A Chinese keyphrase extraction method for multimodal information enhancement representation[J]. Journal of Tsinghua University(Science and Technology). 2024, 64(10): 1785-1796 https://doi.org/10.16511/j.cnki.qhdxxb.2024.27.015

参考文献

[1] 赵京胜, 朱巧明, 周国栋, 等. 自动关键词抽取研究综述[J]. 软件学报, 2017, 28(9): 2431-2449. ZHAO J S, ZHU Q M, ZHOU G D, et al. Review of research in automatic keyword extraction [J]. Journal of Software, 2017, 28(9): 2431-2449. (in Chinese)
[2] 胡少虎, 张颖怡, 章成志. 关键词提取研究综述[J]. 数据分析与知识发现, 2021, 5(3): 45-59. HU S H, ZHANG Y Y, ZHANG C Z. Review of keyword extraction studies [J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 45-59. (in Chinese)
[3] DING N, QIN Y J, YANG G, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models [J]. Nature Machine Intelligence, 2023, 5(3): 220-235.
[4] YIN M W, MOU C J, XIONG K N, et al. Chinese clinical named entity recognition with radical-level feature and self-attention mechanism [J]. Journal of Biomedical Informatics, 2019, 98: 103289.
[5] WANG S N, KHABSA M, MA H. To pretrain or not to pretrain: examining the benefits of pretrainng on resource rich tasks [C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. New York, USA: ACL Press, 2020: 2209-2213.
[6] MENG Y X, WU W, WANG F, et al. Glyce: Glyph-vectors for Chinese character representations [C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS 2019). Vancouver, Canada: Curran Associates Inc., 2019: 2746-2757.
[7] XUAN Z Y, BAO R, JIANG S Y. FGN: Fusion glyph network for Chinese named entity recognition [C]//5th China Conference on Knowledge Graph and Semantic Computing: Knowledge Graph and Cognitive Intelligence. Nanchang, China: Springer Press, 2021: 28-40.
[8] SALTON G, BUCKLEY C. Term-weighting approaches in automatic text retrieval [J]. Information Processing & Management, 1988, 24(5): 513-523.
[9] 庞庆华, 董显蔚, 周斌, 等. 基于情感分析与TextRank的负面在线评论关键词抽取[J]. 情报科学, 2022, 40(5): 111-117. PANG Q H, DONG X W, ZHOU B, et al. Keyword extraction of negative online reviews based on sentiment analysis [J]. Information Science, 2022, 40(5): 111-117. (in Chinese)
[10] 马慧芳, 刘芳, 夏琴, 等. 基于加权超图随机游走的文献关键词提取算法[J]. 电子学报, 2018, 46(6): 1410-1414. MA H F, LIU F, XIA Q, et al. Extraction algorithm based on weighted hypergraph random walk [J]. Acta Electronica Sinica, 2018, 46(6): 1410-1414. (in Chinese)
[11] TERRYN A R, DROUIN P, HOSTE V, et al. Analysing the impact of supervised machine learning on automatic term extraction: HAMLET vs TermoStat [C]//Proceedings of the International Conference on Recent Advances in Natural Language Processing. Varna, Bulgaria: INCOMA Ltd. Press, 2019: 1012-1021.
[12] ZHANG C Z, WANG H L, LIU Y, et al. Automatic keyword extraction from documents using conditional random fields [J]. Journal of Computational Information Systems, 2008, 4(3): 1169-1180.
[13] ARDIANSYAH S, MAJID M A, ZAIN J M. Knowledge of extraction from trained neural network by using decision tree [C]//Proceedings of 20162nd International Conference on Science in Information Technology (ICSITech). Balikpapan, Indonesia: IEEE Press, 2016: 220-225.
[14] CHAN H P, CHEN W, WANG L, et al. Neural keyphrase generation via reinforcement learning with adaptive rewards [C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: ACL Press, 2019: 2163-2174.
[15] CHEN W, CHAN H P, LI P J, et al. Exclusive hierarchical decoding for deep keyphrase generation [C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. New York, USA: ACL Press, 2020: 1095-1105.
[16] ZHANG Y Y, ZHANG C Z. Using human attention to extract keyphrase from microblog post [C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: ACL Press, 2019: 5867-5872.
[17] 曾道建, 童国维, 戴愿, 等.基于序列到序列模型的法律问题关键词抽取[J].清华大学学报(自然科学版), 2019, 59(04): 256-261. ZENG D J, TONG G W, DAI Y, et al. Keyphrase extraction for legal questions based on a sequence to sequence model [J]. Journal of Tsinghua University (Science and Technology), 2019, 59(4): 256-261. (in Chinese)
[18] ZHANG Q, WANG Y, GONG Y Y, et al. Keyphrase extraction using deep recurrent neural networks on Twitter [C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: ACL Press, 2016: 836-845
[19] 孙新, 盖晨, 申长虹, 等. 基于短语向量和主题加权的关键词抽取方法[J]. 电子学报, 2021, 49(9): 1682-1690. SUN X, GAI C, SHEN C H, et al. The theme-weighted keyphrase extraction algorithm based on phrase embedding [J]. Acta Electronica Sinica, 2021, 49(9): 1682-1690. (in Chinese)
[20] SAHRAWAT D, MAHATA D, ZHANG H M, et al. Keyphrase extraction as sequence labeling using contextualized embeddings [C]//42nd European Conference on IR Research on Advances in Information Retrieval (ECIR). Lisbon, Portugal: Springer, 2020: 328-335.
[21] ZHANG L H, CHEN Q, WANG W, et al. MDERank: A masked document embedding rank approach for unsupervised keyphrase extraction [C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Dublin, Ireland: ACL Press, 2022: 396-409.
[22] SUN S, XIONG C Y, LIU Z H, et al. Capturing global informativeness in open domain keyphrase extraction [C]//Proceedings of the Natural Language Processing and Chinese Computing. QingDao, China: Springer. Press, 2021: 275-287.
[23] SONG M Y, FENG Y, JING L P. Hyperbolic relevance matching for neural keyphrase extraction [C]//Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: ACL Press, 2022: 396-409.
[24] DONG C H, ZHANG J J, ZONG C Q, et al. Character-based LSTM-CRF with radical-level features for Chinese named entity recognition [C]//Natural Language Under-standing and Intelligent Applications: 5th CCF Conference on Natural Language Processing and Chinese Computing and 24th International Conference on Computer Processing of Oriental Languages. Kunming, China: Springer International Press, 2016: 239-250.
[25] HAN X K, YUE Q, CHU J, et al. Multi-feature fusion transformer for Chinese named entity recognition [C]//202241st Chinese Control Conference. Hefei, China: IEEE Press, 2022: 4227-4232.
[26] WU S, SONG X N, FENG Z H. MECT: Multi-metadata embedding based cross-transformer for Chinese named entity recognition [C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. New York, USA: ACL Press, 2021: 1529-1539.
[27] CHEN H Y, YU S H, LIN S D. Glyph2Vec: Learning Chinese out-of-vocabulary word embedding from glyphs [C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. New York, USA: ACL Press, 2020: 2865-2871.
[28] CUI Y M, CHE W X, LIU T, et al. Pre-training with whole word masking for Chinese BERT [C]//Proceedings of IEEE/ACM Transactions on Audio, Speech, and Language Processing. Balikpapan, Indonesia: IEEE Press, 2021: 3504-3514.
[29] SUN Y, QIU H, ZHENG Y, et al. SIFRank: a new baseline for unsupervised keyphrase extraction based on pre-trained language model [J]. IEEE Access, 2020, 8: 10896-10906.
[30] LI Y D, ZHANG Y Q, ZHAO Z, et al. CSL: A large-scale Chinese scientific literature dataset [C]//Proceedings of the 29th International Conference on Computational Linguistics. Gyeongju, Republic of Korea: International Committee on Computational Linguistics Press, 2022: 3917-3923.
[31] WU X Y, BOLIVAR A. Keyword extraction for contextual advertisement [C]//Proceedings of the 17th international conference on World Wide Web. New York, USA: ACM Press, 2008: 1195-1196.
[32] WANG J J, XU C S, CHNG E, et al. Sports highlight detection from keyword sequences using HMM [C]//2004 IEEE International Conference on Multimedia and Expo. Taipei, China: IEEE, 2004: 599-602.
[33] GERO Z, HO J. Word centrality constrained representation for keyphrase extraction [C]//Proceedings of the 20th Workshop on Biomedical Language Processing. New York, USA: ACL Press, 2021: 155-161.
[34] SONG M Y, XIAO L, JING L P. Learning to extract from multiple perspectives for neural keyphrase extraction[J]. Computer Speech & Language, 2023, 81: 101502.

基金

国家自然科学基金青年科学基金项目(62007007);湖南省自然科学基金面上项目(2023JJ30415, 2022JJ30395);湖南省研究生科研创新项目(CX20230485)

PDF(4011 KB)

Accesses

Citation

Detail

段落导航
相关文章

/