Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2017, Vol. 57 Issue (1): 1-6    DOI: 10.16511/j.cnki.qhdxxb.2017.21.001
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于双向门限递归单元神经网络的维吾尔语形态切分
哈里旦木·阿布都克里木, 程勇, 刘洋, 孙茂松
清华大学 计算机科学与技术系, 智能技术与系统国家重点实验室, 清华信息科学与技术国家实验室(筹), 北京 100084
Uyghur morphological segmentation with bidirectional GRU neural networks
ABUDUKELIMU Halidanmu, CHENG Yong, LIU Yang, SUN Maosong
State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
全文: PDF(1041 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 以维吾尔语为代表的低资源、形态丰富语言的信息处理对于满足“一带一路”语言互通的战略需求具有重要意义。这类语言通过组合语素来表示句法和语义关系,因而给语言处理带来严重的数据稀疏问题。该文提出基于双向门限递归单元神经网络的维吾尔语形态切分方法,将维吾尔词自动切分为语素序列,从而缓解数据稀疏问题。双向门限递归单元神经网络能够充分利用双向上下文信息进行切分消歧,并通过门限递归单元有效处理长距离依赖。实验结果表明,该方法相比主流统计方法和单向门限递归单元神经网络获得了显著的性能提升。该方法具有良好的语言无关性,能够用于处理更多的形态丰富语言。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
关键词 双向门限递归单元神经网络维吾尔语形态切分    
Abstract:Information processing of low-resource, morphologically-rich languages such as Uyghur is critical for addressing the language barrier problem faced by the One Belt and One Road (B&R) program in China. In such languages, individual words encode rich grammatical and semantic information by concatenating morphemes to a root form, which leads to severe data sparsity for language processing. This paper introduces an approach for Uyghur morphological segmentation which divides Uyghur words into sequences of morphemes based on bidirectional gated recurrent unit (GRU) neural networks. The bidirectional GRU exploits the bidirectional context to resolve ambiguities and model long-distance dependencies using the gating mechanism. Tests show that this approach significantly outperforms conditional random fields and unidirectional GRUs. This approach is language-independent and can be applied to all morphologically-rich languages.
Key wordsbidirectional gated recurrent unit    neural network    Uyghur    morphological segmentation
收稿日期: 2016-07-08      出版日期: 2017-01-15
ZTFLH:  TP391.2  
通讯作者: 刘洋,副教授,E-mail:liuyang2011@tsinghua.edu.cn     E-mail: liuyang2011@tsinghua.edu.cn
引用本文:   
哈里旦木·阿布都克里木, 程勇, 刘洋, 孙茂松. 基于双向门限递归单元神经网络的维吾尔语形态切分[J]. 清华大学学报(自然科学版), 2017, 57(1): 1-6.
ABUDUKELIMU Halidanmu, CHENG Yong, LIU Yang, SUN Maosong. Uyghur morphological segmentation with bidirectional GRU neural networks. Journal of Tsinghua University(Science and Technology), 2017, 57(1): 1-6.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2017.21.001  或          http://jst.tsinghuajournals.com/CN/Y2017/V57/I1/1
  表1 维吾尔语词示例
  图1 面向维吾尔语切分的双向门限递归单元神经网络
  图2 门限递归单元
  表2 维吾尔语形态切分语料库
  表3 维吾尔语词语频度和比例
  表4 维吾尔语语素频度和比例
  表5 向量维度对BiGRU形态切分性能的影响
  表6 对比实验结果
  表7 维吾尔语形态切分实例分析
[16] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[1] Orhun M, Tanguǎ C, Adal? E. Rule based analysis of the Uyghur nouns[J]. International Journal on Asian Language Processing, 2009, 19(1):33-43.
[17] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[Z/OL]. (2014-09-01). https://arxiv.org/abs/1409.0473
[18] Schuster M, Paliwal K. Bidirectional recurrent neural networks[J]. IEEE Transactions on signal processing, 1997, 45(11):2673-2681.
[2] Sami V, Peter S, Arne G et al. Morfessor 2.0:Python Implementation and Extensions for Morfessor Baseline, ISBN 978-952-60-5501-5[R]. Helsinki:Aalto University, 2013.
[19] Graves A, Jaitly N, Mohamed A. Hybrid speech recognition with deep bidirectional ISTM[C]//2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech:IEEE, 2014:8-12.
[3] Lafferty J, McCallum A, Pereira F. Conditional random fields:probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the 18th International Conference on Machine Learning. Williamstown, MA, USA:Morgan Kaufmann, 2001:282-289.
[4] Ruokolainen T, Kohonen O, Virpioja S et al. Supervised morphological segmentation in a low-resource learning setting using conditional random fields[C]//Proceeding of the Seventeenth Conference on Computational National Language Learning. Sofia, Bulgaria:Association for Computational Linguistics, 2013:8-9.
[5] Aisha B, SUN Maosong. A statistical method for Uyghur tokenization[C]//International Conference on Natural Language Processing and Knowledge Engineering. Dalian:IEEE, 2009:24-27.
[6] 买热哈巴·艾力, 姜文斌, 王志洋, 等. 维吾尔语词法分析的有向图模型[J]. 软件学报, 2012, 23(12):3115-3129. Aili M, JIANG Wenbin, WANG Zhiyang, et al. Directed graph model of Uyghur morphological analysis[J]. Journal of Software, 2012, 23(12):3115-3129. (in Chinese)
[7] Wumaier A, Tian S. Conditional random fields combined FSM stemming method for Uyghur[C]//International Conference on Computer Science and Information Technology. Beijing:IEEE, 2009:8-11.
[8] Ablimit M, Kawahara T, Pattar A, et al. Stem-affix based Uyghur morphological analyzer[J]. International Journal of Future Generation Communication and Networking, 2016, 9(2):59-72.
[9] Chung J, Gulcehre C, Cho K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[Z/OL]. (2014-12-11). https://arxiv.org/abs/1412.3555.
[10] Chen X, Qiu X, Zhu C et al. Long short-term memory neural networks for Chinese word segmentation[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal:Association for Computational Linguistics, 2015:17-21.
[11] Yao Y, Huang Z. Bi-directional LSTM recurrent neural network for Chinese word segmentation[Z/OL]. (2016-02-16). http://arxiv.org/abs/1602.04874.
[12] Morita H, Kawahara D, Kurohashi S. Morphological analysis for unsegmented languages using recurrent neural network language model[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal:Association for Computational Linguistics, 2015:17-21.
[13] Wang L, Cao Z, Xia Y, et al. Morphological segmentation with window ISTM neural networks[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Phoenix, AZ, USA:Association for the Advancement of Artificial Intelligence, 2016:2842-2848.
[14] Wang P, Qian Y, Soong F, et al. Part-of-speech tagging with bidirectional long short-term memory recurrent neural network[Z/OL]. (2015-10-21). http://arxiv.org/abs/1510.06168.
[15] Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult[J]. IEEE Transactions on neural networks, 1994, 5(2):157-166.
[16] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[17] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate[Z/OL]. (2014-09-01). https://arxiv.org/abs/1409.0473
[18] Schuster M, Paliwal K. Bidirectional recurrent neural networks[J]. IEEE Transactions on signal processing, 1997, 45(11):2673-2681.
[19] Graves A, Jaitly N, Mohamed A. Hybrid speech recognition with deep bidirectional ISTM[C]//2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Olomouc, Czech:IEEE, 2014:8-12."
[1] 庄文宇, 张如九, 徐建军, 殷亮, 魏海宁, 刘耀儒. 基于IAGA-BP算法的高拱坝-坝基力学参数反演分析[J]. 清华大学学报(自然科学版), 2022, 62(8): 1302-1313.
[2] 于京池, 金爱云, 潘坚文, 王进廷, 张楚汉. 基于GA-BP神经网络的拱坝地震易损性分析[J]. 清华大学学报(自然科学版), 2022, 62(8): 1321-1329.
[3] 王立平, 张超, 蔡恩磊, 史慧杰, 王冬. 面向自主工业软件的知识提取和知识库构建方法[J]. 清华大学学报(自然科学版), 2022, 62(5): 978-986.
[4] 张羽中, 柳世强, 张俊昌, 朱荣. 人体下肢运动协调模型[J]. 清华大学学报(自然科学版), 2022, 62(3): 458-462.
[5] 刘树栋, 张嘉妮, 陈旭. 评论感知的异构变分自编码器推荐模型[J]. 清华大学学报(自然科学版), 2022, 62(1): 88-97.
[6] 季宇, 张悠慧, 郑纬民. 基于忆阻器的近似计算方法[J]. 清华大学学报(自然科学版), 2021, 61(6): 610-617.
[7] 王晓萌, 管志斌, 辛伟, 王嘉捷. 基于深度卷积神经网络的源代码缺陷检测方法[J]. 清华大学学报(自然科学版), 2021, 61(11): 1267-1272.
[8] 杜晓闯, 涂红兵, 黎岢, 张洁, 王康, 刘鹤敏, 梁漫春, 汪向伟. 基于径向基神经网络仿真γ能谱模板库的核素识别方法[J]. 清华大学学报(自然科学版), 2021, 61(11): 1308-1315.
[9] 韩坤, 潘海为, 张伟, 边晓菲, 陈春伶, 何舒宁. 基于多模态医学图像的Alzheimer病分类方法[J]. 清华大学学报(自然科学版), 2020, 60(8): 664-671,682.
[10] 尹学振, 赵慧, 赵俊保, 姚婉薇, 黄泽林. 多神经网络协作的军事领域命名实体识别[J]. 清华大学学报(自然科学版), 2020, 60(8): 648-655.
[11] 孙博文, 张鹏, 成茗宇, 李新童, 李祺. 基于代码图像增强的恶意代码检测方法[J]. 清华大学学报(自然科学版), 2020, 60(5): 386-392.
[12] 林鹏, 魏鹏程, 樊启祥, 陈闻起. 基于CNN模型的施工现场典型安全隐患数据学习[J]. 清华大学学报(自然科学版), 2019, 59(8): 628-634.
[13] 王文广, 陈运文, 蔡华, 曾彦能, 杨慧宇. 基于混合深度神经网络模型的司法文书智能化处理[J]. 清华大学学报(自然科学版), 2019, 59(7): 505-511.
[14] 梁杰, 陈嘉豪, 张雪芹, 周悦, 林家骏. 基于独热编码和卷积神经网络的异常检测[J]. 清华大学学报(自然科学版), 2019, 59(7): 523-529.
[15] 王晓明, 赵歆波. 基于深度神经网络的个体阅读眼动预测[J]. 清华大学学报(自然科学版), 2019, 59(6): 468-475.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn