Multilayer structure based lexicon optimization for language modeling
Mijit Ablimit1,2, Akbar Pattar2, Askar Hamdulla1,2
1. School of Science and Technology, Xinjiang University, Urumqi 830046, China;
2. School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Abstract:An appropriate lexicon set must be selected as an important first step in developing large vocabulary continuous speech recognition (LVCSR) systems. The word unit is chosen as the lexicon basis to avoid word boundary detection problems. However, the lexicon basis selection is not as simple for the derivative morphological structure (e.g., agglutinative languages). Furthermore, there are no word boundaries in many languages such as Chinese and Japanese. This paper uses the Uyghur LVCSR system to analyze various particle based automatic speech recognition (ASR) systems with comparisons of the ASR results for various linguistic layers to develop a method to balance the advantages of two layer lexicons. The ASR results for the two layers are aligned and compared to analyze error patterns and extract samples as training data for the alternative selection method. Tests show that this method effectively improves the ASR accuracy with a small lexicon size.
Kawahara T, Lee A, Kobayashi T, et al. Free software toolkit for Japanese large vocabulary continuous speech recognition[C]//Proceedings of International Conference on Spoken Language Processing (ICSLP). Beijing, China:INTERSPEECH, 2000, 4:476-479.
[2]
George S, Mukund P. Data-driven approach to designing compound words for continuous speech recognition[J]. IEEE Transactions on Speech and Audio Processing, 2001, 9(4):327-332.
[3]
Kwon O W, Park J. Korean large vocabulary continuous speech recognition with morpheme-based recognition units[J]. Speech Communication, 2003, 39(3):287-300.
[4]
Kwon O W. Performance of LVCSR with morpheme-based and syllable-based recognition units[C]//International Conference of Acoustics, Speech and Signal Processing (ICASSP). Istanbul, Turkey:IEEE Press, 2000:1567-1570.
[5]
Jongtaveesataporn M, Hienlikit I, Wutiwiwatchai C, et al. Lexical units for Thai LVCSR[J]. Speech Communication, 2009, 51(4):379-389.
[6]
Hacioglu K, Pellom B, Ciloglu T, et al. On lexicon creation for turkish LVCSR[C]//Eurospeech. Geneva, Switzerland:EUROSPEECH, 2003:1165-1168.
[7]
Arisoy E, Sak H, Saraclar M. Language modeling for automatic turkish broadcast news transcription[C]//INTERSPEECH. Antwerp, Belgium:INTERSPEECH, 2007:2381-2384.
[8]
Roark B, Saraclar M, Ollins M. Discriminative n-gram language modeling[J]. Computer Speech and Language, 2007, 21(2):373-392.
[9]
Mijit Ablimit, Neubig G, Mimura M, et al. Uyghur Morpheme-based language models and ASR[C]//IEEE International Conference of Signal Processing (IEEE-ICSP). Beijing, China:IEEE Press, 2010:581-584.
米吉提·阿不里米提, 艾斯卡尔·艾木都拉, 库尔班·吾布力. 维吾尔语中的语音和谐规律及算法的实现[C]//中国科协学术年会论文集, 乌鲁木齐, 中国:中国科学技术出版社, 2005:621-626. Mijit Ablimit, Askar Hamdulla, Kurban Ubul. The Uyghur phonetic harmony rules and their implementation[C]//Annual Conference of China Association for Science. Urumqi, China:Science and technology of China Press, 2005:621-626. (in Chinese)
[12]
米吉提·阿不里米提. 在多文种环境下的维吾尔语文字校对系统的开发研究[J]. 系统工程理论与实践, 2003, 23(5):117-124. Mijit Ablimit. Research on Uighur corrector system in multilingual environment[J]. Systems Engineering-theory & Practice, 2003, 23(5):117-124. (in Chinese)
[13]
古丽拉·阿东别克, 米吉提·阿不里米提. 维吾尔语词切分方法初探[J]. 中文信息学报, 2005, 18(6):61-65. Gulila Adungbieke, Mijit Ablimit. Research on Uighur word segmentation[J]. Journal of Chinese Information Processing, 2005, 18(6):61-65. (in Chinese)
[14]
米吉提·阿不里米提, 艾斯卡尔·艾木都拉, 吐尔地·托合提. 维吾尔语词法分析器研究开发[C]//全国第11届少数民族语言文字信息处理学术研讨会, 西双版纳, 中国:西苑出版社, 2007:408-412. Mijit Ablimit, Askar Hamdulla, Turdy Tohti. Research on Uyghur morphologicalanalyzer[C]//The 11th National Conference on Minority Language Information Processing Symposium. Xishuangbanna, China:Xiyuan Press, 2007:408-412. (in Chinese)
[15]
米热古丽·艾力, 米吉提·阿不里米提, 艾斯卡尔·艾木都拉. 基于词法分析的维吾尔语元音弱化算法研究[J]. 中文信息学报, 2008, 22(4):43-47. Miriguli Aili, Mijit Ablimit, Askar Hamdulla. A morphological analysis based algorithm for Uyghur word weakening identification[J]. Journal of Chinese Information Processing, 2008, 22(4):43-47. (in Chinese)"