计算机科学与技术

基于注意力LSTM和多任务学习的远场语音识别

  • 张宇 ,
  • 张鹏远 ,
  • 颜永红
展开
  • 1. 中国科学院 声学研究所, 语言声学与内容理解重点实验室, 北京 100190;
    2. 中国科学院大学, 北京 100049;
    3. 中国科学院 新疆理化技术研究所, 新疆民族语音语言信息处理实验室, 乌鲁木齐 830011
张宇(1991-),女,博士研究生。

收稿日期: 2017-09-20

  网络出版日期: 2018-03-15

基金资助

国家自然科学基金资助项目(U1536117,11590770,11590771,11590772,11590773,11590774);国家重点研发计划重点专项(2016YFB0801203,2016YFB0801200)

Long short-term memory with attention and multitask learning for distant speech recognition

  • ZHANG Yu ,
  • ZHANG Pengyuan ,
  • YAN Yonghong
Expand
  • 1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China;
    2. University of Chinese Academy of Sciences, Beijing 100049, China;
    3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumchi 830011, China

Received date: 2017-09-20

  Online published: 2018-03-15

摘要

由于背景噪声、混响以及人声干扰等因素,远场语音识别任务一直充满挑战性。该文针对远场语音识别任务,提出基于注意力机制和多任务学习框架的长短时记忆递归神经网络(long short-term memory,LSTM)声学模型。模型中嵌入的注意力机制使其自动学习调整对扩展上下文特征输入的关注度,显著提升了模型对远场语音的建模能力。为进一步提高模型的鲁棒性,引入多任务学习框架,使其联合预测声学状态和干净特征。AMI数据集上的实验结果表明:与基线模型相比,引入注意力机制和多任务学习框架的LSTM模型获得了1.5%的绝对词错误率下降。

本文引用格式

张宇 , 张鹏远 , 颜永红 . 基于注意力LSTM和多任务学习的远场语音识别[J]. 清华大学学报(自然科学版), 2018 , 58(3) : 249 -253 . DOI: 10.16511/j.cnki.qhdxxb.2018.25.016

Abstract

Distant speech recognition remains a challenging task owning to background noise, reverberation, and competing acoustic sources. This work describes a long short-term memory (LSTM) based acoustic model with an attention mechanism and a multitask learning architecture for distant speech recognition. The attention mechanism is embedded in the acoustic model to automatically tune its attention to the spliced context input which significantly improves the ability to model distant speech. A multitask learning architecture, which is trained to predict the acoustic model states and the clean features, is used to further improve the robustness. Evaluations of the model on the AMI meeting corpus show that the model reduces word error rate (WER) by 1.5% over the baseline model.

参考文献

[1] HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2] SAK H, SENIOR A, BEAUFAYS F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]//15th Annual Conference of the International Speech Communication Association. Singapore:IEEE, 2014:338-342.
[3] SWIETOJANSKI P, GHOSHAL A, RENALS S. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition[C]//IEEE Workshop on Automatic Speech Recognition and Understanding Workshop. Olomouc, Czech Republic:IEEE, 2013:285-290.
[4] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China:IEEE, 2016:4945-4949.
[5] LU L, ZHANG X, CHO K, et al. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition[C]//16th Annual Conference of the International Speech Communication Association. Dresden, Germany:IEEE, 2015:3249-3253.
[6] YU D, XIONG W, DROPPO J, et al. Deep convolutional neural networks with layer-wise context expansion and attention[C]//17th Annual Conference of the International Speech Communication Association. San Francisco, CA, USA:IEEE, 2016:17-21.
[7] CARLETTA J. Unleashing the killer corpus:Experiences in creating the multi-everything AMI meeting corpus[J]. Language Resources and Evaluation, 2007, 41(2):181-190.
[8] BENGIO Y, SIMARD P, FRASCONI P. Learning long-term dependencies with gradient descent is diffcult[J]. IEEE Transactions on Neural Networks, 1994, 5(2):157-166.
[9] PETER B, RENALS S. Regularization of context-dependent deep neural networks with context-independent multi-task training[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane, Australia:IEEE, 2015:4290-4294.
[10] HUANG J T, LI J, YU D, et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:IEEE, 2013:7304-7308.
[11] GAO T, DU J, DAI L, et al. Joint training of front-end and back-end deep neural networks for robust speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia:IEEE, 2015:4375-4379.
[12] POVEY D, ARNAB G, GILLES B, et al. The Kaldi speech recognition toolkit[C]//IEEE Workshop on Automatic Speech Recognition and Understanding Workshop. Hawaii, USA:IEEE, 2011.
文章导航

/