Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2018, Vol. 58 Issue (1): 55-60    DOI: 10.16511/j.cnki.qhdxxb.2018.21.001
  自动化 本期目录 | 过刊浏览 | 高级检索 |
基于迁移学习的噪声鲁棒语音识别声学建模
易江燕1,2, 陶建华1,2,3, 刘斌1, 温正棋1
1. 中国科学院 自动化研究所, 模式识别国家重点实验室, 北京 100190;
2. 中国科学院大学 人工智能技术学院, 北京 100190;
3. 中国科学院 自动化研究所, 中国科学院脑科学与智能技术研究中心, 北京 100190
Transfer learning for acoustic modeling of noise robust speech recognition
YI Jiangyan1,2, TAO Jianhua1,2,3, LIU Bin1, WEN Zhengqi1
1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;
2. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100190, China;
3. CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF(1413 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 为了提高噪声环境下语音识别系统的鲁棒性,提出了一种基于迁移学习的声学建模方法。该方法用干净语音的声学模型(老师模型)指导带噪语音的声学模型(学生模型)进行训练。学生模型在训练过程中,尽量使其逼近老师模型的后验概率分布。学生模型和老师模型间的后验概率分布差异通过相对熵(KL divergence)加以最小化。CHiME-2数据集上的实验结果表明,该方法的平均词错率(WER)比基线的绝对下降了7.29%,比CHiME-2竞赛第一名的绝对下降了3.92%。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
易江燕
陶建华
刘斌
温正棋
关键词 鲁棒语音识别声学模型神经网络迁移学习    
Abstract:Speech recognition in noisy environments was improved by using transfer learning to train acoustic models. The training of an acoustic model trained with noisy data (student model) is guided by an acoustic model trained with clean data (teacher model). This training process forces the posterior probability distribution of the student model to be close to the teacher model by minimizing the Kullback-Leibler (KL) divergence between the posterior probability distribution of the student model and that of the teacher model. Tests on the CHiME-2 dataset show that this method gives a 7.29% absolute average word error rate (WER) improvement over the baseline model and 3.92% absolute average WER improvement over the best CHiME-2 system.
Key wordsrobust speech recognition    acoustic model    deep neural network    transfer learning
收稿日期: 2017-09-29      出版日期: 2018-01-15
ZTFLH:  TP391.42  
  TP183  
通讯作者: 陶建华,教授,E-mail:jhtao@nlpr.ia.ac.cn     E-mail: jhtao@nlpr.ia.ac.cn
引用本文:   
易江燕, 陶建华, 刘斌, 温正棋. 基于迁移学习的噪声鲁棒语音识别声学建模[J]. 清华大学学报(自然科学版), 2018, 58(1): 55-60.
YI Jiangyan, TAO Jianhua, LIU Bin, WEN Zhengqi. Transfer learning for acoustic modeling of noise robust speech recognition. Journal of Tsinghua University(Science and Technology), 2018, 58(1): 55-60.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.21.001  或          http://jst.tsinghuajournals.com/CN/Y2018/V58/I1/55
  图1 老师模型指导学生模型的训练流程
  表1 不同声学模型在带噪语音测试集上的 WER
  表2 C G GMM 和老师模型在干净语音数据集上的 WER
  表3 学生模型在噪声测试集(e v a l 9 2_5 k)上的 WER
[1] HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:2013:6645-6649.
[3] HASIM S, ANDREW S, FRANÇOISE B. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J]. Computer Science, 2014(3):338-342.
[4] XIONG W, DROPPO J, HUANG X, et al. The microsoft 2016 conversational speech recognition system[R/OL]. (2016-09-12)[2017-02-25]. https://arxiv.org/abs/1609.03528.
[5] SAON G, SERCU T, RENNIE S, et al. The IBM 2016 English conversational telephone speech recognition system[R/OL]. (2016-04-27)[2017-02-25]. https://arxiv.org/abs/1604.08242.
[6] 蔡尚, 金鑫, 高圣翔, 等. 用于噪声鲁棒性语音识别的子带能量规整感知线性预测系数[J]. 声学学报, 2012(6):667-672. CAI S, JIN X, GAO S X, et al. Noise robust speech recognition based on sub-band energy warping perception linear prediction coefficient[J]. Chinese Journal of Acoustics, 2012(6):667-672. (in Chinese)
[7] 胡旭琰, 邹月娴, 王文敏.基于MDT特征补偿的噪声鲁棒语音识别算法[J]. 清华大学学报(自然科学版), 2013(6):753-756. HU X Y, ZOU Y X, WANG W M. Robust noise feature compensation method for speech recognition based on missing data technology[J]. Journal of Tsinghua University (Science and Technology), 2013(6):753-756. (in Chinese)
[8] GALES M J F, PYE D, WOODLAND P C. Variance compensation within the MLLR framework for robust speech recognition and speaker adaptation[C]//International Conference on Spoken Language. Philadelphia, USA, 1996:1832-1835.
[9] SIOHAN O, CHESTA C, LEE C H. Hidden Markov model adaptation using maximum a posteriori linear regression[C]//Workshop on Robust Methods for Speech Recognition in Adverse Conditions. Tampere, Finland, 1999:147-150.
[10] TRAN D T, DELROIX M, OGAWA A, et al. Factorized linear input network for acoustic model adaptation in noisy conditions[C]//Conference of the International Speech Communication Association. San Francisco, USA 2016:3813-3817.
[11] SELTZER M L, YU D, WANG Y. An investigation of deep neural networks for noise robust speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013:7398-7402.
[12] YU D, SELTZER M L, LI J, et al. Feature learning in deep neural networks:Studies on speech recognition tasks[J]. Computer Science, 2013(2):329-338.
[13] LI B, SIM K C. A spectral masking approach to noise-robust speech recognition using deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech & Language Processing, 2014, 22(8):1296-1305.
[14] 王青, 吴侠, 杜俊, 等. 基于DNN特征融合的噪声鲁棒性语音识别[C]//全国人机语音通讯学术会议.天津:天津大学, 2015:23-29. WANG Q, WU X, DU J, et al. DNN based feature fusion for noise robust speech recognition[C]//National Conference on Man-Machine Speech Communication. Tianjin:Tianjin University, 2015:23-29. (in Chinese)
[15] ABE A, YAMAMOTO K, NAKAGAWA S. Robust speech recognition using DNN-HMM acoustic model combining noise-aware training with spectral subtraction[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2015:2849-2853.
[16] XU Y, DU J, DAI L, et al. Dynamic noise aware training for speech enhancement based on deep neural networks[C]//Conference of the International Speech Communication Association. Singapore, 2014:2670-2674.
[17] VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C]//International Conference on Machine Learning. Helsinki, Finland, 2008:1096-1103.
[18] KANG H L, KANG S J, KANG W H, et al. Two-stage noise aware training using asymmetric deep denoising autoencoder[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, 2016:5765-5769.
[19] MIMURA M, SAKAI S, KAWAHARA T. Joint optimization of denoising autoencoder and DNN acoustic model based on multi-target learning for noisy speech recognition[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2016:3803-3807.
[20] QIAN Y, TAN T, YU D. An investigation into using parallel data for far-field speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, 2016:5725-5729.
[21] BUCILU C, CARUANA R, et al. Model compression[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, USA, 2006:535-541.
[22] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. Computer Science, 2015(7):382-390.
[23] LI J. Learning small-size DNN with output-distribution-based criteria[C]//Conference of the International Speech Communication Association, Singapore, 2014:2650-2654.
[24] CHAN W, KE N R, LANE I. Transferring knowledge from a RNN to a DNN[J]. Computer Science, 2015(7):138-143.
[25] CHEBOTAR Y, WATERS A. Distilling knowledge from ensembles of neural networks for speech recognition[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2016:3439-3443.
[26] VINCENT E, BARKER J, WATANABE S, et al. The second "CHiME" speech separation and recognition challenge:Datasets, tasks and baselines[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013:126-130.
[27] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit[C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Big Island, USA, 2011.
[28] TACHIOKA Y. Discriminative methods for noise robust speech recognition:A CHiME challenge benchmark[C]//CHiME Workshop. Vancouver, Canada, 2013:6935-6939.
[1] 王昀, 胡珉, 塔娜, 孙海涛, 郭毅峰, 周武爱, 郭昱, 张皖哲, 冯建华. 大语言模型及其在政务领域的应用[J]. 清华大学学报(自然科学版), 2024, 64(4): 649-658.
[2] 张雪芹, 刘岗, 王智能, 罗飞, 吴建华. 基于多特征融合和深度学习的微观扩散预测[J]. 清华大学学报(自然科学版), 2024, 64(4): 688-699.
[3] 张名芳, 李桂林, 吴初娜, 王力, 佟良昊. 基于轻量型空间特征编码网络的驾驶人注视区域估计算法[J]. 清华大学学报(自然科学版), 2024, 64(1): 44-54.
[4] 杨波, 邱雷, 吴书. 异质图神经网络协同过滤模型[J]. 清华大学学报(自然科学版), 2023, 63(9): 1339-1349.
[5] 赵传君, 武美龄, 申利华, 上官学奎, 王彦婕, 李杰, 王素格, 李德玉. 基于句法结构迁移和领域融合的跨领域情感分类[J]. 清华大学学报(自然科学版), 2023, 63(9): 1380-1389.
[6] 付雯, 温浩, 黄俊珲, 孙镔轩, 陈嘉杰, 陈武, 冯跃, 段星光. 基于非线性动力学模型补偿的水下机械臂自适应滑模控制[J]. 清华大学学报(自然科学版), 2023, 63(7): 1068-1077.
[7] 黄贲, 康飞, 唐玉. 基于目标检测的混凝土坝裂缝实时检测方法[J]. 清华大学学报(自然科学版), 2023, 63(7): 1078-1086.
[8] 陈波, 张华, 陈永灿, 李永龙, 熊劲松. 基于特征增强的水工结构裂缝语义分割方法[J]. 清华大学学报(自然科学版), 2023, 63(7): 1135-1143.
[9] 代鑫, 黄弘, 汲欣愉, 王巍. 基于机器学习的城市暴雨内涝时空快速预测模型[J]. 清华大学学报(自然科学版), 2023, 63(6): 865-873.
[10] 李聪健, 高航, 刘奕. 基于数值模拟和机器学习的风场快速重构方法[J]. 清华大学学报(自然科学版), 2023, 63(6): 882-887.
[11] 杜晓闯, 梁漫春, 黎岢, 俞彦成, 刘欣, 汪向伟, 王汝栋, 张国杰, 付起. 基于卷积神经网络的γ放射性核素识别方法[J]. 清华大学学报(自然科学版), 2023, 63(6): 980-986.
[12] 安健, 陈宇轩, 苏星宇, 周华, 任祝寅. 机器学习在湍流燃烧及发动机中的应用与展望[J]. 清华大学学报(自然科学版), 2023, 63(4): 462-472.
[13] 孙继昊, 宋颖, 石云姣, 赵宁波, 郑洪涛. 天然气同轴分级燃烧室污染物生成及预测[J]. 清华大学学报(自然科学版), 2023, 63(4): 649-659.
[14] 刘江帆, 葛冰, 李珊珊, 芦翔. 基于神经网络的燃烧室壁面冷效预测方法[J]. 清华大学学报(自然科学版), 2023, 63(4): 681-690.
[15] 郭世圆, 马为之, 卢瑞麟, 刘晋龙, 杨志刚, 王忠静, 张敏. 基于LSTM神经网络的复杂工况下明渠流量预测[J]. 清华大学学报(自然科学版), 2023, 63(12): 1924-1934.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn