Transfer learning for acoustic modeling of noise robust speech recognition
YI Jiangyan1,2, TAO Jianhua1,2,3, LIU Bin1, WEN Zhengqi1
1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;
2. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100190, China;
3. CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Abstract:Speech recognition in noisy environments was improved by using transfer learning to train acoustic models. The training of an acoustic model trained with noisy data (student model) is guided by an acoustic model trained with clean data (teacher model). This training process forces the posterior probability distribution of the student model to be close to the teacher model by minimizing the Kullback-Leibler (KL) divergence between the posterior probability distribution of the student model and that of the teacher model. Tests on the CHiME-2 dataset show that this method gives a 7.29% absolute average word error rate (WER) improvement over the baseline model and 3.92% absolute average WER improvement over the best CHiME-2 system.
易江燕, 陶建华, 刘斌, 温正棋. 基于迁移学习的噪声鲁棒语音识别声学建模[J]. 清华大学学报(自然科学版), 2018, 58(1): 55-60.
YI Jiangyan, TAO Jianhua, LIU Bin, WEN Zhengqi. Transfer learning for acoustic modeling of noise robust speech recognition. Journal of Tsinghua University(Science and Technology), 2018, 58(1): 55-60.
HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2]
GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:2013:6645-6649.
[3]
HASIM S, ANDREW S, FRANÇOISE B. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J]. Computer Science, 2014(3):338-342.
[4]
XIONG W, DROPPO J, HUANG X, et al. The microsoft 2016 conversational speech recognition system[R/OL]. (2016-09-12)[2017-02-25]. https://arxiv.org/abs/1609.03528.
[5]
SAON G, SERCU T, RENNIE S, et al. The IBM 2016 English conversational telephone speech recognition system[R/OL]. (2016-04-27)[2017-02-25]. https://arxiv.org/abs/1604.08242.
[6]
蔡尚, 金鑫, 高圣翔, 等. 用于噪声鲁棒性语音识别的子带能量规整感知线性预测系数[J]. 声学学报, 2012(6):667-672. CAI S, JIN X, GAO S X, et al. Noise robust speech recognition based on sub-band energy warping perception linear prediction coefficient[J]. Chinese Journal of Acoustics, 2012(6):667-672. (in Chinese)
[7]
胡旭琰, 邹月娴, 王文敏.基于MDT特征补偿的噪声鲁棒语音识别算法[J]. 清华大学学报(自然科学版), 2013(6):753-756. HU X Y, ZOU Y X, WANG W M. Robust noise feature compensation method for speech recognition based on missing data technology[J]. Journal of Tsinghua University (Science and Technology), 2013(6):753-756. (in Chinese)
[8]
GALES M J F, PYE D, WOODLAND P C. Variance compensation within the MLLR framework for robust speech recognition and speaker adaptation[C]//International Conference on Spoken Language. Philadelphia, USA, 1996:1832-1835.
[9]
SIOHAN O, CHESTA C, LEE C H. Hidden Markov model adaptation using maximum a posteriori linear regression[C]//Workshop on Robust Methods for Speech Recognition in Adverse Conditions. Tampere, Finland, 1999:147-150.
[10]
TRAN D T, DELROIX M, OGAWA A, et al. Factorized linear input network for acoustic model adaptation in noisy conditions[C]//Conference of the International Speech Communication Association. San Francisco, USA 2016:3813-3817.
[11]
SELTZER M L, YU D, WANG Y. An investigation of deep neural networks for noise robust speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013:7398-7402.
[12]
YU D, SELTZER M L, LI J, et al. Feature learning in deep neural networks:Studies on speech recognition tasks[J]. Computer Science, 2013(2):329-338.
[13]
LI B, SIM K C. A spectral masking approach to noise-robust speech recognition using deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech & Language Processing, 2014, 22(8):1296-1305.
[14]
王青, 吴侠, 杜俊, 等. 基于DNN特征融合的噪声鲁棒性语音识别[C]//全国人机语音通讯学术会议.天津:天津大学, 2015:23-29. WANG Q, WU X, DU J, et al. DNN based feature fusion for noise robust speech recognition[C]//National Conference on Man-Machine Speech Communication. Tianjin:Tianjin University, 2015:23-29. (in Chinese)
[15]
ABE A, YAMAMOTO K, NAKAGAWA S. Robust speech recognition using DNN-HMM acoustic model combining noise-aware training with spectral subtraction[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2015:2849-2853.
[16]
XU Y, DU J, DAI L, et al. Dynamic noise aware training for speech enhancement based on deep neural networks[C]//Conference of the International Speech Communication Association. Singapore, 2014:2670-2674.
[17]
VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C]//International Conference on Machine Learning. Helsinki, Finland, 2008:1096-1103.
[18]
KANG H L, KANG S J, KANG W H, et al. Two-stage noise aware training using asymmetric deep denoising autoencoder[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, 2016:5765-5769.
[19]
MIMURA M, SAKAI S, KAWAHARA T. Joint optimization of denoising autoencoder and DNN acoustic model based on multi-target learning for noisy speech recognition[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2016:3803-3807.
[20]
QIAN Y, TAN T, YU D. An investigation into using parallel data for far-field speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, 2016:5725-5729.
[21]
BUCILU C, CARUANA R, et al. Model compression[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, USA, 2006:535-541.
[22]
HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. Computer Science, 2015(7):382-390.
[23]
LI J. Learning small-size DNN with output-distribution-based criteria[C]//Conference of the International Speech Communication Association, Singapore, 2014:2650-2654.
[24]
CHAN W, KE N R, LANE I. Transferring knowledge from a RNN to a DNN[J]. Computer Science, 2015(7):138-143.
[25]
CHEBOTAR Y, WATERS A. Distilling knowledge from ensembles of neural networks for speech recognition[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2016:3439-3443.
[26]
VINCENT E, BARKER J, WATANABE S, et al. The second "CHiME" speech separation and recognition challenge:Datasets, tasks and baselines[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013:126-130.
[27]
POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit[C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Big Island, USA, 2011.
[28]
TACHIOKA Y. Discriminative methods for noise robust speech recognition:A CHiME challenge benchmark[C]//CHiME Workshop. Vancouver, Canada, 2013:6935-6939.