基于迁移学习的噪声鲁棒语音识别声学建模

doi:10.16511/j.cnki.qhdxxb.2018.21.001

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(1413 KB)
输出: BibTeX | EndNote (RIS)

摘要为了提高噪声环境下语音识别系统的鲁棒性，提出了一种基于迁移学习的声学建模方法。该方法用干净语音的声学模型（老师模型）指导带噪语音的声学模型（学生模型）进行训练。学生模型在训练过程中，尽量使其逼近老师模型的后验概率分布。学生模型和老师模型间的后验概率分布差异通过相对熵（KL divergence）加以最小化。CHiME-2数据集上的实验结果表明，该方法的平均词错率（WER）比基线的绝对下降了7.29%，比CHiME-2竞赛第一名的绝对下降了3.92%。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	易江燕
	陶建华
	刘斌
	温正棋

关键词 ：鲁棒语音识别, 声学模型, 神经网络, 迁移学习

Abstract：Speech recognition in noisy environments was improved by using transfer learning to train acoustic models. The training of an acoustic model trained with noisy data (student model) is guided by an acoustic model trained with clean data (teacher model). This training process forces the posterior probability distribution of the student model to be close to the teacher model by minimizing the Kullback-Leibler (KL) divergence between the posterior probability distribution of the student model and that of the teacher model. Tests on the CHiME-2 dataset show that this method gives a 7.29% absolute average word error rate (WER) improvement over the baseline model and 3.92% absolute average WER improvement over the best CHiME-2 system.

Key words： robust speech recognition acoustic model deep neural network transfer learning

收稿日期: 2017-09-29 出版日期: 2018-01-15

ZTFLH:	TP391.42
	TP183

通讯作者: 陶建华,教授,E-mail:jhtao@nlpr.ia.ac.cn E-mail: jhtao@nlpr.ia.ac.cn

引用本文:

易江燕, 陶建华, 刘斌, 温正棋. 基于迁移学习的噪声鲁棒语音识别声学建模[J]. 清华大学学报（自然科学版）, 2018, 58(1): 55-60.
YI Jiangyan, TAO Jianhua, LIU Bin, WEN Zhengqi. Transfer learning for acoustic modeling of noise robust speech recognition. Journal of Tsinghua University(Science and Technology), 2018, 58(1): 55-60.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.21.001 或 http://jst.tsinghuajournals.com/CN/Y2018/V58/I1/55

图１老师模型指导学生模型的训练流程

表１不同声学模型在带噪语音测试集上的 WER

表２ C Ｇ GMM 和老师模型在干净语音数据集上的 WER

表３学生模型在噪声测试集(e v a l ９２_５ k)上的 WER

[1]	HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2]	GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:2013:6645-6649.
[3]	HASIM S, ANDREW S, FRANÇOISE B. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J]. Computer Science, 2014(3):338-342.
[4]	XIONG W, DROPPO J, HUANG X, et al. The microsoft 2016 conversational speech recognition system[R/OL]. (2016-09-12)[2017-02-25]. https://arxiv.org/abs/1609.03528.
[5]	SAON G, SERCU T, RENNIE S, et al. The IBM 2016 English conversational telephone speech recognition system[R/OL]. (2016-04-27)[2017-02-25]. https://arxiv.org/abs/1604.08242.
[6]	蔡尚, 金鑫, 高圣翔, 等. 用于噪声鲁棒性语音识别的子带能量规整感知线性预测系数[J]. 声学学报, 2012(6):667-672. CAI S, JIN X, GAO S X, et al. Noise robust speech recognition based on sub-band energy warping perception linear prediction coefficient[J]. Chinese Journal of Acoustics, 2012(6):667-672. (in Chinese)
[7]	胡旭琰, 邹月娴, 王文敏.基于MDT特征补偿的噪声鲁棒语音识别算法[J]. 清华大学学报(自然科学版), 2013(6):753-756. HU X Y, ZOU Y X, WANG W M. Robust noise feature compensation method for speech recognition based on missing data technology[J]. Journal of Tsinghua University (Science and Technology), 2013(6):753-756. (in Chinese)
[8]	GALES M J F, PYE D, WOODLAND P C. Variance compensation within the MLLR framework for robust speech recognition and speaker adaptation[C]//International Conference on Spoken Language. Philadelphia, USA, 1996:1832-1835.
[9]	SIOHAN O, CHESTA C, LEE C H. Hidden Markov model adaptation using maximum a posteriori linear regression[C]//Workshop on Robust Methods for Speech Recognition in Adverse Conditions. Tampere, Finland, 1999:147-150.
[10]	TRAN D T, DELROIX M, OGAWA A, et al. Factorized linear input network for acoustic model adaptation in noisy conditions[C]//Conference of the International Speech Communication Association. San Francisco, USA 2016:3813-3817.
[11]	SELTZER M L, YU D, WANG Y. An investigation of deep neural networks for noise robust speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013:7398-7402.
[12]	YU D, SELTZER M L, LI J, et al. Feature learning in deep neural networks:Studies on speech recognition tasks[J]. Computer Science, 2013(2):329-338.
[13]	LI B, SIM K C. A spectral masking approach to noise-robust speech recognition using deep neural networks[J]. IEEE/ACM Transactions on Audio, Speech & Language Processing, 2014, 22(8):1296-1305.
[14]	王青, 吴侠, 杜俊, 等. 基于DNN特征融合的噪声鲁棒性语音识别[C]//全国人机语音通讯学术会议.天津:天津大学, 2015:23-29. WANG Q, WU X, DU J, et al. DNN based feature fusion for noise robust speech recognition[C]//National Conference on Man-Machine Speech Communication. Tianjin:Tianjin University, 2015:23-29. (in Chinese)
[15]	ABE A, YAMAMOTO K, NAKAGAWA S. Robust speech recognition using DNN-HMM acoustic model combining noise-aware training with spectral subtraction[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2015:2849-2853.
[16]	XU Y, DU J, DAI L, et al. Dynamic noise aware training for speech enhancement based on deep neural networks[C]//Conference of the International Speech Communication Association. Singapore, 2014:2670-2674.
[17]	VINCENT P, LAROCHELLE H, BENGIO Y, et al. Extracting and composing robust features with denoising autoencoders[C]//International Conference on Machine Learning. Helsinki, Finland, 2008:1096-1103.
[18]	KANG H L, KANG S J, KANG W H, et al. Two-stage noise aware training using asymmetric deep denoising autoencoder[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, 2016:5765-5769.
[19]	MIMURA M, SAKAI S, KAWAHARA T. Joint optimization of denoising autoencoder and DNN acoustic model based on multi-target learning for noisy speech recognition[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2016:3803-3807.
[20]	QIAN Y, TAN T, YU D. An investigation into using parallel data for far-field speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, 2016:5725-5729.
[21]	BUCILU C, CARUANA R, et al. Model compression[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia, USA, 2006:535-541.
[22]	HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. Computer Science, 2015(7):382-390.
[23]	LI J. Learning small-size DNN with output-distribution-based criteria[C]//Conference of the International Speech Communication Association, Singapore, 2014:2650-2654.
[24]	CHAN W, KE N R, LANE I. Transferring knowledge from a RNN to a DNN[J]. Computer Science, 2015(7):138-143.
[25]	CHEBOTAR Y, WATERS A. Distilling knowledge from ensembles of neural networks for speech recognition[C]//Conference of the International Speech Communication Association. Dresden, Germany, 2016:3439-3443.
[26]	VINCENT E, BARKER J, WATANABE S, et al. The second "CHiME" speech separation and recognition challenge:Datasets, tasks and baselines[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada, 2013:126-130.
[27]	POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit[C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Big Island, USA, 2011.
[28]	TACHIOKA Y. Discriminative methods for noise robust speech recognition:A CHiME challenge benchmark[C]//CHiME Workshop. Vancouver, Canada, 2013:6935-6939.

[1]	王昀, 胡珉, 塔娜, 孙海涛, 郭毅峰, 周武爱, 郭昱, 张皖哲, 冯建华. 大语言模型及其在政务领域的应用[J]. 清华大学学报（自然科学版）, 2024, 64(4): 649-658.
[2]	张雪芹, 刘岗, 王智能, 罗飞, 吴建华. 基于多特征融合和深度学习的微观扩散预测[J]. 清华大学学报（自然科学版）, 2024, 64(4): 688-699.
[3]	张名芳, 李桂林, 吴初娜, 王力, 佟良昊. 基于轻量型空间特征编码网络的驾驶人注视区域估计算法[J]. 清华大学学报（自然科学版）, 2024, 64(1): 44-54.
[4]	杨波, 邱雷, 吴书. 异质图神经网络协同过滤模型[J]. 清华大学学报（自然科学版）, 2023, 63(9): 1339-1349.
[5]	赵传君, 武美龄, 申利华, 上官学奎, 王彦婕, 李杰, 王素格, 李德玉. 基于句法结构迁移和领域融合的跨领域情感分类[J]. 清华大学学报（自然科学版）, 2023, 63(9): 1380-1389.
[6]	付雯, 温浩, 黄俊珲, 孙镔轩, 陈嘉杰, 陈武, 冯跃, 段星光. 基于非线性动力学模型补偿的水下机械臂自适应滑模控制[J]. 清华大学学报（自然科学版）, 2023, 63(7): 1068-1077.
[7]	黄贲, 康飞, 唐玉. 基于目标检测的混凝土坝裂缝实时检测方法[J]. 清华大学学报（自然科学版）, 2023, 63(7): 1078-1086.
[8]	陈波, 张华, 陈永灿, 李永龙, 熊劲松. 基于特征增强的水工结构裂缝语义分割方法[J]. 清华大学学报（自然科学版）, 2023, 63(7): 1135-1143.
[9]	代鑫, 黄弘, 汲欣愉, 王巍. 基于机器学习的城市暴雨内涝时空快速预测模型[J]. 清华大学学报（自然科学版）, 2023, 63(6): 865-873.
[10]	李聪健, 高航, 刘奕. 基于数值模拟和机器学习的风场快速重构方法[J]. 清华大学学报（自然科学版）, 2023, 63(6): 882-887.
[11]	杜晓闯, 梁漫春, 黎岢, 俞彦成, 刘欣, 汪向伟, 王汝栋, 张国杰, 付起. 基于卷积神经网络的γ放射性核素识别方法[J]. 清华大学学报（自然科学版）, 2023, 63(6): 980-986.
[12]	安健, 陈宇轩, 苏星宇, 周华, 任祝寅. 机器学习在湍流燃烧及发动机中的应用与展望[J]. 清华大学学报（自然科学版）, 2023, 63(4): 462-472.
[13]	孙继昊, 宋颖, 石云姣, 赵宁波, 郑洪涛. 天然气同轴分级燃烧室污染物生成及预测[J]. 清华大学学报（自然科学版）, 2023, 63(4): 649-659.
[14]	刘江帆, 葛冰, 李珊珊, 芦翔. 基于神经网络的燃烧室壁面冷效预测方法[J]. 清华大学学报（自然科学版）, 2023, 63(4): 681-690.
[15]	郭世圆, 马为之, 卢瑞麟, 刘晋龙, 杨志刚, 王忠静, 张敏. 基于LSTM神经网络的复杂工况下明渠流量预测[J]. 清华大学学报（自然科学版）, 2023, 63(12): 1924-1934.

Viewed

Full text

Abstract

Cited

Shared

Discussed