Abstract:The Uyghur language has a little speech data for training acoustic models due to various data acquisition and annotation difficulties. This paper describes a modeling method for crosslingual acoustic models based on long short-term memory models. Mass Chinese language training data is used to train a deep neural network acoustic model. The network output layer weights are then randomly modified to create the output layer for the Uyghur language. A Uyghur language acoustic model is then trained using Uyghur language speech data to update all the weights. Tests show that this method reduces the word error rates of the Uyghur language transcription and dictation recognition by 20% and 30% than the baseline system. Thus, this method improves the Uyghur language acoustic model with better initial weights from the Chinese language data to train hidden layers in the neural network, and enhances the network robustness.
[1] 麦麦提艾力·吐尔逊, 戴礼荣. 深度神经网络在维吾尔语大词汇量连续语音识别中的应用[J]. 数据采集与处理, 2015, 30(2):365-371. MAIMAITIAILI T, DAI L R. Deep neural network based Uyghur large vocabulary continuous speech recognition[J]. Journal of Data Acquisition and Processing, 2015, 30(2):365-371. (in Chinese) [2] 其米克·巴特西, 黄浩, 王羡慧. 基于深度神经网络的维吾尔语语音识别[J]. 计算机工程与设计, 2015, 36(8):2239-2244. QIMIKE B, HUANG H, WANG X H. Uyghur speech recognition based on deep neural network[J]. Computer Engineering and Design, 2015, 36(8):2239-2244. (in Chinese) [3] 刘林泉, 郑方, 吴文虎. 基于小数据量的方言普通话语音识别声学建模[J]. 清华大学学报(自然科学版), 2008, 48(4):604-607. LIU L Q, ZHENG F, WU W H. Small dataset-based acoustic modeling for dialectal Chinese speech recognition[J]. Journal of Tsinghua University (Science and Technology), 2008, 48(4):604-607. (in Chinese) [4] SCHULTZ T, WAIBEL A. Experiments on cross-language acoustic modeling[C]//The 7th European Conference on Speech Communication and Technology. Aalborg, Denmark, 2001:2721-2724. [5] POVEY D, BURGET L, AGARWAL M, et al. The subspace Gaussian mixture model:A structured model for speech recognition[J]. Computer Speech & Language, 2011, 25(2):404-439. [6] BURGET L, SCHWARZ P, AGARWAL M, et al. Multilingual acoustic modeling for speech recognition based on subspace Gaussian mixture models[C]//IEEE International Conference on Acoustics Speech and Signal Processing. Dallas, USA, 2010:4334-4337. [7] STOLCKE A, GREZL F, HWANG M Y, et al. Cross-domain and cross-language portability of acoustic features estimated by multilayer perceptron[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Toulouse, France, 2006:321-324. [8] VESELý K, KARAFIÁT M, GRÉZL F, et al. The language-independent bottleneck features[C]//2012 Workshop on Spoken Language Technology. Miami, USA, 2012:336-341. [9] SWIETOJANSKI P, GHOSHAL A, RENALS S. Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR[C]//2012 Workshop on Spoken Language Technology. Miami, USA, 2012:246-251. [10] SIM K C, LI H. Context-sensitive probabilistic phone mapping model for cross-lingual speech recognition[C]//9th Annual Conference of the International Speech Communication Association. Brisbane, Australia, 2008:2715-2718. [11] DO V H, XIAO X, CHNG E S, et al. Context dependant phone mapping for cross-lingual acoustic modeling[C]//20128th International Symposium on Chinese Spoken Language Processing. Hong Kong, China, 2012:16-20. [12] HUANG J T, LI J, YU D, et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013:7304-7308. [13] ROBINSON A J. An application of recurrent nets to phone probability estimation[J]. IEEE Transactions on Neural Networks, 1994, 5(2):298-305.