基于深度学习的多语言跨领域主题对齐模型

doi:10.16511/j.cnki.qhdxxb.2020.21.003

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(1105 KB)
输出: BibTeX | EndNote (RIS)

摘要在主题深度表示学习的基础上，该文提出了一种融合双语词嵌入的主题对齐模型（topic alignment model，TAM），通过双语词嵌入扩充语义对齐词汇词典，在传统双语主题模型基础上设计辅助分布用于改进不同词分布的语义共享，以此改善跨语言和跨领域情境下的主题对齐效果；提出了2种新的指标，即双语主题相似度（bilingual topic similarity，BTS）和双语对齐相似度（bilingual alignment similarity，BAS），用于评价辅助分布对齐的效果。相比传统的对齐模型MCTA，TAM在跨语言主题对齐任务中双语对齐相似度提升了约1.5%，在跨领域主题对齐任务中F1值提升了约10%。研究结果对于改进跨语言和跨领域信息处理具有重要意义。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	余传明
	原赛
	胡莎莎
	安璐

关键词 ：跨语言主题对齐, 跨领域主题对齐, 深度学习, 双语词嵌入, 知识对齐

Abstract：Deep representation learning of domain topics was used to build a topic alignment model (TAM) with integrated bilingual word embedding. The semantic alignment lexicon was extended to include bilingual word embedding. A traditional bilingual topic model was used to develop an auxiliary distribution to improve the word distribution semantic sharing to improve the topic alignments in the cross-lingual and cross-domain contexts. A bilingual topic similarity (BTS) indicator and a bilingual alignment similarity (BAS) indicator were developed to evaluate the supplementary alignment. The bilingual alignment similarity improved the cross-language topic matching by about 1.5% compared to a traditional multi-language common cultural theme analysis and improved F1 by about 10% for cross-domain topic alignment. These results can improve cross language and cross domain information processing.

Key words： cross-lingual topic alignment cross-domain topic alignment deep learning bilingual word embedding knowledge alignment

收稿日期: 2019-06-15 出版日期: 2020-04-26

基金资助:安璐,教授,E-mail:anlu97@163.com

引用本文:

余传明, 原赛, 胡莎莎, 安璐. 基于深度学习的多语言跨领域主题对齐模型[J]. 清华大学学报（自然科学版）, 2020, 60(5): 430-439.
YU Chuanming, YUAN Sai, HU Shasha, AN Lu. Deep learning multi-language topic alignment model across domains. Journal of Tsinghua University(Science and Technology), 2020, 60(5): 430-439.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2020.21.003 或 http://jst.tsinghuajournals.com/CN/Y2020/V60/I5/430

[1] PAPADIMITRIOU C H, RAGHAVAN P, TAMAKI H, et al. Latent semantic indexing:A probabilistic analysis[J]. Journal of Computer and System Sciences, 2000, 61(2):217-235.
[2] 夏青, 严馨, 余正涛, 等. 融合要素及主题的汉越双语新闻话题分析[J]. 计算机工程, 2016, 42(9):186-191.XIA Q, YAN X, YU Z T, et al. Analysis of sino-Vietnamese bilingual news topics mixing elements and themes[J]. Computer Engineering, 2016, 42(9):186-191. (in Chinese)
[3] 唐莫鸣, 朱明玮, 余正涛, 等. 基于双语主题和因子图模型的汉语-越南语双语事件关联分析[J]. 中文信息学报, 2017, 31(6):125-131, 139.TANG M M, ZHU M W, YU Z T, et al. Chinese-Vietnamese bilingual event correlation analysis based on bilingual topic and factor graph[J]. Journal of Chinese Information Processing, 2017, 31(6):125-131, 139. (in Chinese)
[4] 司莉, 陈雨雪, 曾粤亮. 基于多语言本体的中英跨语言信息检索模型及实现[J]. 图书情报工作, 2017, 61(1):100-108.SI L, CHEN Y X, ZENG Y L. A study on cross-language information retrieval model based on multilingual ontology[J]. Library and Information Service, 2017, 61(1):100-108. (in Chinese)
[5] 余传明, 冯博琳, 田鑫, 等. 基于深度表示学习的多语言文本情感分析[J]. 山东大学学报(理学版), 2018, 53(3):13-23.YU C M, FENG B L, TIAN X, et al. Deep representative learning based sentiment analysis in the cross-lingual environment[J]. Journal of Shandong University (Natural Science), 2018, 53(3):13-23. (in Chinese)
[6] 许海云, 董坤, 刘春江, 等. 文本主题识别关键技术研究综述[J]. 情报科学, 2017, 35(1):153-160.XU H Y, DONG K, LIU C J, et al. A review on topic identification of scientific text files[J]. Information Science, 2017, 35(1):153-160. (in Chinese)
[7] 余传明, 安璐. 从小数据到大数据——观点检索面临的三个挑战[J]. 情报理论与实践, 2016, 39(2):13-19.YU C M, AN L. From small data to big data:Three challenges for opinion retrieval[J]. Information Studies (Theory & Application), 2016, 39(2):13-19. (in Chinese)
[8] WEI X, CROFT W B. LDA-based document models for ad-hoc retrieval[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, USA:ACM Press, 2006:178-185.
[9] LI S H, CHUA T S, ZHU J, et al. Generative topic embedding:A continuous representation of documents[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Berlin, Germany:Association for Computational Linguistics, 2016:666-675.
[10] LIU Y, LIU Z Y, CHUA T S, et al. Topical word embeddings[EB/OL].[2018-02-19]. https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/viewPaper/9314.
[11] ZHANG H, ZHONG G Q. Improving short text classification by learning vector representations of both words and hidden topics[J]. Knowledge-Based Systems, 2016, 102:76-86.
[12] MOODY C E. Mixing dirichlet topic models and word embeddings to make lda2vec[EB/OL].[2018-05-06]. https://arxiv.org/abs/1605.02019.
[13] LI D Y, LI Y, WANG S G. Topic enhanced word vectors for documents representation[M]//CHENG X, MA W, LIU H, et al. Social Media Processing. SMP 2017. Singapore:Springer, 2017:166-177.
[14] 杨奇奇. 基于多主题空间的跨领域文本分类方法研究[D]. 合肥:合肥工业大学, 2017.YANG Q Q. Research on cross-domain text classification based on multi-topic spaces[D]. Hefei:Hefei University of Technology, 2017. (in Chinese)
[15] WU T X, ZHANG L, QI G L, et al. Encoding category correlations into bilingual topic modeling for cross-lingual taxonomy alignment[M]//D'AMATO C. The Semantic Web-ISWC 2017. ISWC 2017. Cham:Springer, 2017:728-744.
[16] TAMURA A, SUMITA E. Bilingual segmented topic model[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Berlin, Germany:Association for Computational Linguistics, 2016:1266-1276.
[17] ZHANG D, MEI Q Z, ZHAI C X. Cross-lingual latent topic extraction[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Uppsala, Sweden:Association for Computational Linguistics, 2010:1128-1137.
[18] ZHANG T, LIU K, ZHAO J. Cross lingual entity linking with bilingual topic model[EB/OL].[2013-06-30]. https://www.aaai.org/ocs/index.php/IJCAI/IJCAI13/paper/viewPaper/6268.
[19] WU T X, QI G L, WANG H F, et al. Cross-lingual taxonomy alignment with bilingual biterm topic model[EB/OL].[2018-06-21]. https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/viewPaper/12011.
[20] HEYMAN G, VULIĆ I, MOENS M F. C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content[J]. Data Mining and Knowledge Discovery, 2016, 30(5):1299-1323.
[21] SAKATA Y, EGUCHI K. Cross-lingual link prediction using multimodal relational topic models[C]//2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS). Okayama, Japan:IEEE, 2016:1-8.
[22] LI L H, JIN X M, LONG M S. Topic correlation analysis for cross-domain text classification[C]//Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence. Toronto, Ontario, Canada:AAAI Press, 2012.
[23] YANG P, GAO W, TAN Q, et al. A link-bridged topic model for cross-domain document classification[J]. Information Processing & Management, 2013, 49(6):1181-1193.
[24] 杨奇奇, 张玉红, 胡学钢. 一种基于多桥映射的跨领域文本分类方法[J]. 计算机应用研究, 2018, 35(4):996-1000.YANG Q Q, ZHANG Y H, HU X G. Cross-domain text classification approach based on multi-bridge mapping[J]. Application Research of Computers, 2018, 35(4):996-1000. (in Chinese)
[25] ARTETXE M, LABAKA G, AGIRRE E. Learning bilingual word embeddings with (almost) no bilingual data[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Vancouver, Canada:Association for Computational Linguistics, 2017:451-462.
[26] SHI B, LAM W, BING L D, et al. Detecting common discussion topics across culture from news reader comments[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). Berlin, Germany:Association for Computational Linguistics, 2016:676-685.

[1]	黄贲, 康飞, 唐玉. 基于目标检测的混凝土坝裂缝实时检测方法[J]. 清华大学学报（自然科学版）, 2023, 63(7): 1078-1086.
[2]	苗旭鹏, 张敏旭, 邵蓥侠, 崔斌. PS-Hybrid: 面向大规模推荐模型训练的混合通信框架[J]. 清华大学学报（自然科学版）, 2022, 62(9): 1417-1425.
[3]	梅杰, 李庆斌, 陈文夫, 邬昆, 谭尧升, 刘春风, 王东民, 胡昱. 基于目标检测模型的混凝土坯层覆盖间歇时间超时预警[J]. 清华大学学报（自然科学版）, 2021, 61(7): 688-693.
[4]	管志斌, 王晓萌, 辛伟, 王嘉捷. 源代码缺陷检测数据生成及标注方法[J]. 清华大学学报（自然科学版）, 2021, 61(11): 1240-1245.
[5]	韩坤, 潘海为, 张伟, 边晓菲, 陈春伶, 何舒宁. 基于多模态医学图像的Alzheimer病分类方法[J]. 清华大学学报（自然科学版）, 2020, 60(8): 664-671,682.
[6]	王志国, 章毓晋. 监控视频异常检测：综述[J]. 清华大学学报（自然科学版）, 2020, 60(6): 518-529.
[7]	蒋文斌, 王宏斌, 刘湃, 陈雨浩. 基于AVX2指令集的深度学习混合运算策略[J]. 清华大学学报（自然科学版）, 2020, 60(5): 408-414.
[8]	宋欣瑞, 张宪琦, 张展, 陈新昊, 刘宏伟. 多传感器数据融合的复杂人体活动识别[J]. 清华大学学报（自然科学版）, 2020, 60(10): 814-821.
[9]	张思聪, 谢晓尧, 徐洋. 基于dCNN的入侵检测方法[J]. 清华大学学报（自然科学版）, 2019, 59(1): 44-52.
[10]	芦效峰, 蒋方朔, 周箫, 崔宝江, 伊胜伟, 沙晶. 基于API序列特征和统计特征组合的恶意样本检测框架[J]. 清华大学学报（自然科学版）, 2018, 58(5): 500-508.
[11]	张新钰, 高洪波, 赵建辉, 周沫. 基于深度学习的自动驾驶技术综述[J]. 清华大学学报（自然科学版）, 2018, 58(4): 438-444.
[12]	邹权臣, 张涛, 吴润浦, 马金鑫, 李美聪, 陈晨, 侯长玉. 从自动化到智能化:软件漏洞挖掘技术进展[J]. 清华大学学报（自然科学版）, 2018, 58(12): 1079-1094.
[13]	张敏, 丁弼原, 马为之, 谭云志, 刘奕群, 马少平. 基于深度学习加强的混合推荐方法[J]. 清华大学学报（自然科学版）, 2017, 57(10): 1014-1021.

Viewed

Full text

Abstract

Cited

Shared

Discussed