神经机器翻译系统在维吾尔语-汉语翻译中的性能对比

doi:10.16511/j.cnki.qhdxxb.2017.22.054

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(999 KB)
输出: BibTeX | EndNote (RIS)

摘要基于深度学习的神经机器翻译已在多个语言对上显著超过传统的统计机器翻译，成为当前的主流机器翻译技术。该文从词粒度层面出发，对国际上具有影响力的6种神经机器翻译方法在维吾尔语-汉语翻译任务上进行了深入分析和比较，这6种方法分别是基于注意力机制（GroundHog），词表扩大（LV-groundhog），源语言和目标语言采用子词（subword-nmt）、字符与词混合（nmt.hybrid）、子词与字符（dl4mt-cdec）以及完全字符（dl4mt-c2c）方法。实验结果表明：源语言采用子词、目标语言采用字符的方法（dl4mt-cdec）在维吾尔语-汉语神经机器翻译任务上性能最佳。该文不仅是首次将神经机器翻译方法应用到维吾尔语-汉语机器翻译任务上，也是首次将不同的神经机器翻译方法在同一语料库上进行了对比分析。该研究对维吾尔语-汉语机器翻译任务和神经机器翻译的进一步研究工作都具有重要的参考意义。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章

关键词 ：神经机器翻译, 资源匮乏语言, 维吾尔语

Abstract：The neural machine translation based on deep learning significantly surpasses the traditional statistical machine translation in many languages, and becomes the current mainstream machine translation technology. This paper compares six influential neural machine translation methods from the level of word granularity in the task of Uyghur-Chinese machine translation. These methods are attention mechanism (GroundHog), vocabulary expansion (LV-groundhog), source language and target language with subword units (subword-nmt), characters and words mixed (nmt.hybrid), subword units and characters (dl4mt-cdec), and complete characters (dl4mt-c2c). The experimental results show that Uyghur-Chinese neural machine translation performs best when the source language is segmented into subword units and the target language is represented by characters (dl4mt-cdec). This paper is the first to use neural machine translation for Uyghur-Chinese machine translation and the first to compare different neural machine translation methods on the same corpus. This work is an important reference not only for Uyghur-Chinese machine translation, but also for general neural machine translation tasks.

Key words： neural machine translation low-resource language Uyghur

收稿日期: 2017-02-23 出版日期: 2017-08-15

ZTFLH:

TP391.2

通讯作者: 刘洋,副教授,E-mail:liuyang2011@tsinghua.edu.cn E-mail: liuyang2011@tsinghua.edu.cn

引用本文:

哈里旦木·阿布都克里木, 刘洋, 孙茂松. 神经机器翻译系统在维吾尔语-汉语翻译中的性能对比[J]. 清华大学学报（自然科学版）, 2017, 57(8): 878-883.
Halidanmu Abudukelimu, LIU Yang, SUN Maosong. Performance comparison of neural machinetranslation systems in Uyghur-Chinese translation. Journal of Tsinghua University(Science and Technology), 2017, 57(8): 878-883.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2017.22.054 或 http://jst.tsinghuajournals.com/CN/Y2017/V57/I8/878

表１　维吾尔语语言特性举例

表２　训练集、开发集和测试集语料库统计

表３　词表大小与形态切分对GroundHog系统性能的影响

表４　dl４mtＧc２c系统在中文分词与否情况下的性能

表５　统计机器翻译及不同神经机器翻译系统的对比

[1]	Sutskever I, Vinyals O, Le Q. Sequence to Sequence Learning with Neural Networks[Z/OL]. (2014-09-10)[2015-03-15]. https://arxiv.org/abs/1409.3215.
[2]	Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[Z/OL]. (2014-09-01)[2015-03-20]. https://arxiv.org/abs/1409.0473.
[3]	Wu Y, Schuster M, Chen Z, et al. Google's Neural Machine Translation System:Bridging the Gap between Human and Machine Translation[Z/OL]. (2016-09-26)[2016-10-02]. https://arxiv.org/abs/1609.08144.
[4]	Johnson M, Schuster M, Le Q, et al. Google's Multilingual Neural Machine Translation System:Enabling Zero-Shot Translation[Z/OL]. (2016-11-14)[2016-11-16]. https://arxiv.org/abs/1611.04558.
[5]	Jean S, Cho K, Memisevic R, et al. On using very large target vocabulary for neural machine translation[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. Beijing, 2015:1-10.
[6]	Luong M, Sutskever I, Le Q, et al. Addressing the rare word problem in neural machine translation[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics. Beijing, 2015:11-19.
[7]	Chung J, Cho K, Bengio Y. A character-level decoder without explicit segmentation for neural machine translation[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany, 2016:1693-1703.
[8]	Ling W, Trancoso I, Dyer C, et al. Character-Based Neural Machine Translation[Z/OL]. (2015-11-14)[2016-02-03]. https://arxiv.org/abs/1511.04586.
[9]	Costa-Jussà M, Fonollosa J. Character-based neural machine translation[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany, 2016:357-361.
[10]	Luong M, Manning C. Achieving open vocabulary neural machine translation with hybrid word-character models[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany, 2016:1054-1063.
[11]	Lee J, Cho K, Hofmann T. Fully Character-Level Neural Machine Translation without Explicit Segmentation[Z/OL]. (2016-10-10)[2016-10-22]. https://arxiv.org/abs/1610.03017.
[12]	Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Berlin, Germany, 2016:1715-1725.
[13]	Luong M, Le Q, Sutskever I, et al. Multi-Task Sequence to Sequence Learning[Z/OL]. (2015-11-19)[2016-07-17]. https://arxiv.org/abs/1511.06114.
[14]	Firat O, Cho K, Bengio Y. Multi-way, multilingual neural machine translation with a shared attention mechanism[C]//Proceedings of NAACL-HLT. San Diego, CA, USA, 2016:866-875.
[15]	Firat O, Sankaran B, Al-Onaizan Y, et al. Zero-resource translation with multi-lingual neural machine translation[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, TX, USA, 2016:268-277.
[16]	哈里旦木·阿布都克里木, 程勇, 刘洋, 等. 基于双向门限递归单元神经网络的维吾尔语形态切分[J]. 清华大学学报(自然科学版), 2017, 57(1):1-6. ABUDUKELIMU Halidanmu, CHENG Yong, LIU Yang, et al. Uyghur morphological segmentation with bidirectional GRU neural networks[J]. J Tsinghua Univ (Sci and Tech), 2017, 57(1):1-6. (in Chinese)
[17]	Abudukelimu H, Liu Y, Chen X, et al. Learning distributed representations of Uyghur words and morphemes[C]//Proceedings of CCL/NLP-NABD. Guangzhou, 2015:202-211.
[18]	Creutz M, Lagus K. Unsupervised discovery of morphemes[C]//Morphological and Phonological Learning:Proceedings of the 6th Workshop of the ACL Special Interest Group in Computational Phonology (SIGPHON). Stroudsburg, PA, USA, 2002:21-30.
[19]	Koehn P, Hoang H, Birch A, et al. Moses:Open source toolkit for statistical machine translation[C]//Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Pargue, Czech Republic, 2007:177-180.
[20]	Sennrich R. How Grammatical Is Character-Level Neural Machine Translation Assessing MT Quality with Contrastive Translation Pairs[Z/OL]. (2016-12-14)[2016-12-20]. https://arxiv.org/abs/1612.04629.
[21]	Zohp B, Yuret D, May J, et al. Transfer learning for low-resource neural machine translation[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, TX, 2016:1568-1575.
[22]	Luong M, Pham H, Manning C. Effective approaches to attention-based neural machine translation[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal, 2015:1412-1421.

[1]	满志博, 毛存礼, 余正涛, 李训宇, 高盛祥, 朱俊国. 基于多语言联合训练的汉-英-缅神经机器翻译方法[J]. 清华大学学报（自然科学版）, 2021, 61(9): 927-935.
[2]	努尔麦麦提·尤鲁瓦斯, 刘俊华, 吾守尔·斯拉木, 热依曼·吐尔逊, 达吾勒·阿布都哈依尔. 跨语言声学模型在维吾尔语语音识别中的应用[J]. 清华大学学报（自然科学版）, 2018, 58(4): 342-346.
[3]	阿布都克力木·阿布力孜, 江铭虎, 姚登峰, 哈里旦木·阿布都克里木. 形态复杂词加工的认知神经机制[J]. 清华大学学报（自然科学版）, 2017, 57(4): 393-398.
[4]	米吉提·阿不里米提, 艾克白尔·帕塔尔, 艾斯卡尔·艾木都拉. 基于层次化结构的语言模型单元集优化[J]. 清华大学学报（自然科学版）, 2017, 57(3): 257-263.
[5]	赛牙热·依马木, 热依莱木·帕尔哈提, 艾斯卡尔·艾木都拉, 李志军. 基于不同关键词提取算法的维吾尔文本情感辨识[J]. 清华大学学报（自然科学版）, 2017, 57(3): 270-273.
[6]	艾斯卡尔·肉孜, 殷实, 张之勇, 王东, 艾斯卡尔·艾木都拉, 郑方. THUYG-20:免费的维吾尔语语音数据库[J]. 清华大学学报（自然科学版）, 2017, 57(2): 182-187.
[7]	热合木·马合木提, 于斯音·于苏普, 张家俊, 宗成庆, 艾斯卡尔·艾木都拉. 基于模糊匹配与音字转换的维吾尔语人名识别[J]. 清华大学学报（自然科学版）, 2017, 57(2): 188-196.
[8]	阿不都萨拉木·达吾提, 于斯音·于苏普, 艾斯卡尔·艾木都拉. 类别区分词与情感词典相结合的维吾尔文句子情感分类[J]. 清华大学学报（自然科学版）, 2017, 57(2): 197-201.
[9]	哈妮克孜·伊拉洪, 古力米热·依玛木, 玛依努尔·阿吾力提甫, 姑丽加玛丽·麦麦提艾力, 艾斯卡尔·艾木都拉. 维吾尔语感叹句语调起伏度[J]. 清华大学学报（自然科学版）, 2017, 57(12): 1254-1258.
[10]	古力米热·依玛木, 姑丽加玛丽·麦麦提艾力, 玛依努尔·阿吾力提甫, 艾斯卡尔·艾木都拉. 维吾尔语韵律建模[J]. 清华大学学报（自然科学版）, 2017, 57(12): 1259-1264.
[11]	哈里旦木·阿布都克里木, 程勇, 刘洋, 孙茂松. 基于双向门限递归单元神经网络的维吾尔语形态切分[J]. 清华大学学报（自然科学版）, 2017, 57(1): 1-6.

Viewed

Full text

Abstract

Cited

Shared

Discussed