基于层次化结构的语言模型单元集优化

米吉提·阿不里米提, 艾克白尔·帕塔尔, 艾斯卡尔·艾木都拉

清华大学学报(自然科学版) ›› 2017, Vol. 57 ›› Issue (3) : 257-263.

PDF(1191 KB)
PDF(1191 KB)
清华大学学报(自然科学版) ›› 2017, Vol. 57 ›› Issue (3) : 257-263. DOI: 10.16511/j.cnki.qhdxxb.2017.26.006
计算机科学与技术

基于层次化结构的语言模型单元集优化

  • 米吉提·阿不里米提1,2, 艾克白尔·帕塔尔2, 艾斯卡尔·艾木都拉1,2
作者信息 +

Multilayer structure based lexicon optimization for language modeling

  • Mijit Ablimit1,2, Akbar Pattar2, Askar Hamdulla1,2
Author information +
文章历史 +

摘要

对于大词汇量语音识别系统,适当选择基本单元至关重要。虽然以词为基本单元时避免了词边界的确定等复杂过程,但很多派生类结构中(如黏性语言),词比较长,而且很多文字(如中文、日文等)不需要词边界,因而在自然语言处理应用中没有选取基本单元集的固定模式。该文以维吾尔语大词汇量语音识别系统为例,研究基于各个层次化粒度单元的语音识别系统。通过比较各种层次化单元集为基础的语音识别结果,分析错误识别模式,收集被误判的单元序列作为在2层单元序列结构中择优的训练样本库。比较各种单元集的优缺点,提出一种能平衡长单元集和短单元集优点的方法。实验结果表明:该方法不仅可以有效提高语音识别准确率,也大大缩减了词典容量。

Abstract

An appropriate lexicon set must be selected as an important first step in developing large vocabulary continuous speech recognition (LVCSR) systems. The word unit is chosen as the lexicon basis to avoid word boundary detection problems. However, the lexicon basis selection is not as simple for the derivative morphological structure (e.g., agglutinative languages). Furthermore, there are no word boundaries in many languages such as Chinese and Japanese. This paper uses the Uyghur LVCSR system to analyze various particle based automatic speech recognition (ASR) systems with comparisons of the ASR results for various linguistic layers to develop a method to balance the advantages of two layer lexicons. The ASR results for the two layers are aligned and compared to analyze error patterns and extract samples as training data for the alternative selection method. Tests show that this method effectively improves the ASR accuracy with a small lexicon size.

关键词

语音识别 / 语言模型 / 单元集优化 / 层次化结构 / 黏着性语言 / 维吾尔语

Key words

speech recognition / language model / lexicon optimization / multilayer structure / agglutinative language / Uyghur

引用本文

导出引用
米吉提·阿不里米提, 艾克白尔·帕塔尔, 艾斯卡尔·艾木都拉. 基于层次化结构的语言模型单元集优化[J]. 清华大学学报(自然科学版). 2017, 57(3): 257-263 https://doi.org/10.16511/j.cnki.qhdxxb.2017.26.006
Mijit Ablimit, Akbar Pattar, Askar Hamdulla. Multilayer structure based lexicon optimization for language modeling[J]. Journal of Tsinghua University(Science and Technology). 2017, 57(3): 257-263 https://doi.org/10.16511/j.cnki.qhdxxb.2017.26.006
中图分类号: TP391.1   

参考文献

"[1] Kawahara T, Lee A, Kobayashi T, et al. Free software toolkit for Japanese large vocabulary continuous speech recognition[C]//Proceedings of International Conference on Spoken Language Processing (ICSLP). Beijing, China:INTERSPEECH, 2000, 4:476-479. [2] George S, Mukund P. Data-driven approach to designing compound words for continuous speech recognition[J]. IEEE Transactions on Speech and Audio Processing, 2001, 9(4):327-332. [3] Kwon O W, Park J. Korean large vocabulary continuous speech recognition with morpheme-based recognition units[J]. Speech Communication, 2003, 39(3):287-300. [4] Kwon O W. Performance of LVCSR with morpheme-based and syllable-based recognition units[C]//International Conference of Acoustics, Speech and Signal Processing (ICASSP). Istanbul, Turkey:IEEE Press, 2000:1567-1570. [5] Jongtaveesataporn M, Hienlikit I, Wutiwiwatchai C, et al. Lexical units for Thai LVCSR[J]. Speech Communication, 2009, 51(4):379-389. [6] Hacioglu K, Pellom B, Ciloglu T, et al. On lexicon creation for turkish LVCSR[C]//Eurospeech. Geneva, Switzerland:EUROSPEECH, 2003:1165-1168. [7] Arisoy E, Sak H, Saraclar M. Language modeling for automatic turkish broadcast news transcription[C]//INTERSPEECH. Antwerp, Belgium:INTERSPEECH, 2007:2381-2384. [8] Roark B, Saraclar M, Ollins M. Discriminative n-gram language modeling[J]. Computer Speech and Language, 2007, 21(2):373-392. [9] Mijit Ablimit, Neubig G, Mimura M, et al. Uyghur Morpheme-based language models and ASR[C]//IEEE International Conference of Signal Processing (IEEE-ICSP). Beijing, China:IEEE Press, 2010:581-584. [10] Mijit Ablimit, Mirigul Eli, Kawahara T. Partly-supervised Uyghur morpheme segmentation[C]//Oriental-COCOSDA Workshop. Kyoto, Japan:OCOCOSDA, 2008:71-76. [11] 米吉提·阿不里米提, 艾斯卡尔·艾木都拉, 库尔班·吾布力. 维吾尔语中的语音和谐规律及算法的实现[C]//中国科协学术年会论文集, 乌鲁木齐, 中国:中国科学技术出版社, 2005:621-626. Mijit Ablimit, Askar Hamdulla, Kurban Ubul. The Uyghur phonetic harmony rules and their implementation[C]//Annual Conference of China Association for Science. Urumqi, China:Science and technology of China Press, 2005:621-626. (in Chinese) [12] 米吉提·阿不里米提. 在多文种环境下的维吾尔语文字校对系统的开发研究[J]. 系统工程理论与实践, 2003, 23(5):117-124. Mijit Ablimit. Research on Uighur corrector system in multilingual environment[J]. Systems Engineering-theory & Practice, 2003, 23(5):117-124. (in Chinese) [13] 古丽拉·阿东别克, 米吉提·阿不里米提. 维吾尔语词切分方法初探[J]. 中文信息学报, 2005, 18(6):61-65. Gulila Adungbieke, Mijit Ablimit. Research on Uighur word segmentation[J]. Journal of Chinese Information Processing, 2005, 18(6):61-65. (in Chinese) [14] 米吉提·阿不里米提, 艾斯卡尔·艾木都拉, 吐尔地·托合提. 维吾尔语词法分析器研究开发[C]//全国第11届少数民族语言文字信息处理学术研讨会, 西双版纳, 中国:西苑出版社, 2007:408-412. Mijit Ablimit, Askar Hamdulla, Turdy Tohti. Research on Uyghur morphologicalanalyzer[C]//The 11th National Conference on Minority Language Information Processing Symposium. Xishuangbanna, China:Xiyuan Press, 2007:408-412. (in Chinese) [15] 米热古丽·艾力, 米吉提·阿不里米提, 艾斯卡尔·艾木都拉. 基于词法分析的维吾尔语元音弱化算法研究[J]. 中文信息学报, 2008, 22(4):43-47. Miriguli Aili, Mijit Ablimit, Askar Hamdulla. A morphological analysis based algorithm for Uyghur word weakening identification[J]. Journal of Chinese Information Processing, 2008, 22(4):43-47. (in Chinese)"

PDF(1191 KB)

Accesses

Citation

Detail

段落导航
相关文章

/