Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  百年期刊
Journal of Tsinghua University(Science and Technology)    2018, Vol. 58 Issue (1) : 55-60     DOI: 10.16511/j.cnki.qhdxxb.2018.21.001
AUTOMATION |
Transfer learning for acoustic modeling of noise robust speech recognition
YI Jiangyan1,2, TAO Jianhua1,2,3, LIU Bin1, WEN Zhengqi1
1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;
2. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100190, China;
3. CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Download: PDF(1413 KB)  
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks    
Abstract  Speech recognition in noisy environments was improved by using transfer learning to train acoustic models. The training of an acoustic model trained with noisy data (student model) is guided by an acoustic model trained with clean data (teacher model). This training process forces the posterior probability distribution of the student model to be close to the teacher model by minimizing the Kullback-Leibler (KL) divergence between the posterior probability distribution of the student model and that of the teacher model. Tests on the CHiME-2 dataset show that this method gives a 7.29% absolute average word error rate (WER) improvement over the baseline model and 3.92% absolute average WER improvement over the best CHiME-2 system.
Keywords robust speech recognition      acoustic model      deep neural network      transfer learning     
ZTFLH:  TP391.42  
  TP183  
Issue Date: 15 January 2018
Service
E-mail this article
E-mail Alert
RSS
Articles by authors
YI Jiangyan
TAO Jianhua
LIU Bin
WEN Zhengqi
Cite this article:   
YI Jiangyan,TAO Jianhua,LIU Bin, et al. Transfer learning for acoustic modeling of noise robust speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(1): 55-60.
URL:  
http://jst.tsinghuajournals.com/EN/10.16511/j.cnki.qhdxxb.2018.21.001     OR     http://jst.tsinghuajournals.com/EN/Y2018/V58/I1/55
  
  
  
  
[1] HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:2013:6645-6649.
[3] HASIM S, ANDREW S, FRANÇOISE B. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J]. Computer Science, 2014(3):338-342.
[4] XIONG W, DROPPO J, HUANG X, et al. The microsoft 2016 conversational speech recognition system[R/OL]. (2016-09-12)[2017-02-25]. https://arxiv.org/abs/1609.03528.
[5] SAON G, SERCU T, RENNIE S, et al. The IBM 2016 English conversational telephone speech recognition system[R/OL]. (2016-04-27)[2017-02-25]. https://arxiv.org/abs/1604.08242.
[6] 蔡尚, 金鑫, 高圣翔, 等. 用于噪声鲁棒性语音识别的子带能量规整感知线性预测系数[J]. 声学学报, 2012(6):667-672. CAI S, JIN X, GAO S X, et al. Noise robust speech recognition based on sub-band energy warping perception linear prediction coefficient[J]. Chinese Journal of Acoustics, 2012(6):667-672. (in Chinese)
[7] 胡旭琰, 邹月娴, 王文敏.基于MDT特征补偿的噪声鲁棒语音识别算法[J]. 清华大学学报(自然科学版), 2013(6):753-756. HU X Y, ZOU Y X, WANG W M. Robust noise feature compensation method for speech recognition based on missing data technology[J]. Journal of Tsinghua University (Science and Technology), 2013(6):753-756. (in Chinese)
[8] GALES M J F, PYE D, WOODLAND P C. Variance compensation within the MLLR framework for robust speech recognition and speaker adaptation[C]//International Conference on Spoken Language. Philadelphia, USA, 1996:1832-1835.
[9] SIOHAN O, CHESTA C, LEE C H. Hidden Markov model adaptation using maximum a posteriori linear regression[C]//Workshop on Robust Methods for Speech Recognition in Adverse Conditions. Tampere, Finland, 1999:147-150.
[10] TRAN D T, DELROIX M, OGAWA A, et al. Factorized linear input network for acoustic model adaptation in noisy conditions[C]//Conference of the International Speech Communication Association. San Francisco, USA 2016:3813-3817.
[11] SELTZER M L, YU D, WANG Y. An investigation of deep neural networks for noise robust speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013:7398-7402.
[12] YU D, SELTZER M L, LI J, et al. Feature learning in deep neural networks:Studies on speech recognition tasks[J]. Computer Science, 2013(2):329-338.
[13] LI B, SIM K C. A spectral masking approach to noise-robust speech recognition using deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech & Language Processing, 2014, 22(8):1296-1305.
url: http://dx.doi.org/CM Transactions on Audio, Speech
[14] 王青, 吴侠, 杜俊, 等. 基于DNN特征融合的噪声鲁棒性语音识别[C]//全国人机语音通讯学术会议.天津:天津大学, 2015:23-29. WANG Q, WU X, DU J, et al. DNN based feature fusion for noise robust speech recognition[C]//National Conference on Man-Machine Speech Communication. Tianjin:Tianjin University, 2015:23-29. (in Chinese)
[15] ABE A, YAMAMOTO K, NAKAGAWA S. Robust speech recognition using DNN-HMM acoustic model combining noise-aware training with spectral subtraction[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2015:2849-2853.
[16] XU Y, DU J, DAI L, et al. Dynamic noise aware training for speech enhancement based on deep neural networks[C]//Conference of the International Speech Communication Association. Singapore, 2014:2670-2674.
[17] VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C]//International Conference on Machine Learning. Helsinki, Finland, 2008:1096-1103.
[18] KANG H L, KANG S J, KANG W H, et al. Two-stage noise aware training using asymmetric deep denoising autoencoder[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, 2016:5765-5769.
[19] MIMURA M, SAKAI S, KAWAHARA T. Joint optimization of denoising autoencoder and DNN acoustic model based on multi-target learning for noisy speech recognition[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2016:3803-3807.
[20] QIAN Y, TAN T, YU D. An investigation into using parallel data for far-field speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, 2016:5725-5729.
[21] BUCILU C, CARUANA R, et al. Model compression[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, USA, 2006:535-541.
[22] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. Computer Science, 2015(7):382-390.
[23] LI J. Learning small-size DNN with output-distribution-based criteria[C]//Conference of the International Speech Communication Association, Singapore, 2014:2650-2654.
[24] CHAN W, KE N R, LANE I. Transferring knowledge from a RNN to a DNN[J]. Computer Science, 2015(7):138-143.
[25] CHEBOTAR Y, WATERS A. Distilling knowledge from ensembles of neural networks for speech recognition[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2016:3439-3443.
[26] VINCENT E, BARKER J, WATANABE S, et al. The second "CHiME" speech separation and recognition challenge:Datasets, tasks and baselines[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013:126-130.
[27] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit[C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Big Island, USA, 2011.
[28] TACHIOKA Y. Discriminative methods for noise robust speech recognition:A CHiME challenge benchmark[C]//CHiME Workshop. Vancouver, Canada, 2013:6935-6939.
[1] WANG Yun, HU Min, TA Na, SUN Haitao, GUO Yifeng, ZHOU Wuai, GUO Yu, ZHANG Wanzhe, FENG Jianhua. Large language models and their application in government affairs[J]. Journal of Tsinghua University(Science and Technology), 2024, 64(4): 649-658.
[2] ZHAO Chuanjun, WU Meiling, SHEN Lihua, SHANGGUAN Xuekui, WANG Yanjie, LI Jie, WANG Suge, LI Deyu. Cross-domain sentiment classification based on syntactic structure transfer and domain fusion[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(9): 1380-1389.
[3] WANG Wenguan, CHEN Yunwen, CAI Hua, ZENG Yanneng, YANG Huiyu. Judicial document intellectual processing using hybrid deep neural networks[J]. Journal of Tsinghua University(Science and Technology), 2019, 59(7): 505-511.
[4] WANG Xiaoming, ZHAO Xinbo. Eye movement prediction of individuals while reading based on deep neural networks[J]. Journal of Tsinghua University(Science and Technology), 2019, 59(6): 468-475.
[5] ZHANG Xueying, NIU Puhua, GAO Fan. DNN-LSTM based VAD algorithm[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(5): 509-515.
[6] NURMEMET Yolwas, LIU Junhua, WUSHOUR Silamu, REYIMAN Tursun, DAWEL Abilhayer. Crosslingual acoustic modeling in Uyghur speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(4): 342-346.
[7] ZHANG Yu, ZHANG Pengyuan, YAN Yonghong. Long short-term memory with attention and multitask learning for distant speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 249-253.
[8] Aisikaer Rouzi, YIN Shi, ZHANG Zhiyong, WANG Dong, Askar Hamdulla, ZHENG Fang. THUYG-20: A free Uyghur speech database[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(2): 182-187.
[9] GAO Yingying, ZHU Weibin. Describing and predicting affective messages for expressive speech synthesis[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(2): 202-207.
[10] XING Anhao, ZHANG Pengyuan, PAN Jielin, YAN Yonghong. SVD-based DNN pruning and retraining[J]. Journal of Tsinghua University(Science and Technology), 2016, 56(7): 772-776.
[11] TIAN Yao, CAI Meng, HE Liang, LIU Jia. Speaker recognition system based on deep neural networks and bottleneck features[J]. Journal of Tsinghua University(Science and Technology), 2016, 56(11): 1143-1148.
[12] SONG Peng, ZHENG Wenming, ZHAO Li. Cross-corpus speech emotion recognition based on a feature transfer learning method[J]. Journal of Tsinghua University(Science and Technology), 2016, 56(11): 1179-1183.
[13] ZHANG Jinsong, GAO Yingming, XIE Yanlu. Mispronunciation tendency detection using deep neural networks[J]. Journal of Tsinghua University(Science and Technology), 2016, 56(11): 1220-1225.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
Copyright © Journal of Tsinghua University(Science and Technology), All Rights Reserved.
Powered by Beijing Magtech Co. Ltd