Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2018, Vol. 58 Issue (3): 249-253    DOI: 10.16511/j.cnki.qhdxxb.2018.25.016
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于注意力LSTM和多任务学习的远场语音识别
张宇1,2, 张鹏远1,2, 颜永红1,2,3
1. 中国科学院 声学研究所, 语言声学与内容理解重点实验室, 北京 100190;
2. 中国科学院大学, 北京 100049;
3. 中国科学院 新疆理化技术研究所, 新疆民族语音语言信息处理实验室, 乌鲁木齐 830011
Long short-term memory with attention and multitask learning for distant speech recognition
ZHANG Yu1,2, ZHANG Pengyuan1,2, YAN Yonghong1,2,3
1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumchi 830011, China
全文: PDF(0 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 由于背景噪声、混响以及人声干扰等因素,远场语音识别任务一直充满挑战性。该文针对远场语音识别任务,提出基于注意力机制和多任务学习框架的长短时记忆递归神经网络(long short-term memory,LSTM)声学模型。模型中嵌入的注意力机制使其自动学习调整对扩展上下文特征输入的关注度,显著提升了模型对远场语音的建模能力。为进一步提高模型的鲁棒性,引入多任务学习框架,使其联合预测声学状态和干净特征。AMI数据集上的实验结果表明:与基线模型相比,引入注意力机制和多任务学习框架的LSTM模型获得了1.5%的绝对词错误率下降。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张宇
张鹏远
颜永红
关键词 语音识别长短时记忆声学模型注意力机制多任务学习    
Abstract:Distant speech recognition remains a challenging task owning to background noise, reverberation, and competing acoustic sources. This work describes a long short-term memory (LSTM) based acoustic model with an attention mechanism and a multitask learning architecture for distant speech recognition. The attention mechanism is embedded in the acoustic model to automatically tune its attention to the spliced context input which significantly improves the ability to model distant speech. A multitask learning architecture, which is trained to predict the acoustic model states and the clean features, is used to further improve the robustness. Evaluations of the model on the AMI meeting corpus show that the model reduces word error rate (WER) by 1.5% over the baseline model.
Key wordsspeech recognition    long short-term memory    acoustic model    attention mechanism    multitask learning
收稿日期: 2017-09-20      出版日期: 2018-03-14
ZTFLH:  TN912.34  
基金资助:国家自然科学基金资助项目(U1536117,11590770,11590771,11590772,11590773,11590774);国家重点研发计划重点专项(2016YFB0801203,2016YFB0801200)
通讯作者: 张鹏远,研究员,E-mail:zhangpengyuan@hccl.ioa.ac.cn     E-mail: zhangpengyuan@hccl.ioa.ac.cn
作者简介: 张宇(1991-),女,博士研究生。
引用本文:   
张宇, 张鹏远, 颜永红. 基于注意力LSTM和多任务学习的远场语音识别[J]. 清华大学学报(自然科学版), 2018, 58(3): 249-253.
ZHANG Yu, ZHANG Pengyuan, YAN Yonghong. Long short-term memory with attention and multitask learning for distant speech recognition. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 249-253.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.25.016  或          http://jst.tsinghuajournals.com/CN/Y2018/V58/I3/249
  图1 LSTM 记忆模块
  图2 基于注意力机制和多任务学习框架的LSTM 声学模型
  表1 LSTM 与ALSTM 模型的识别词错误率对比
  表2 加入多任务学习框架后,模型的识别字错误率对比
  图3 模型的帧级对数似然概率变化对比图
  图4 模型的帧正确率变化对比图
[1] HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2] SAK H, SENIOR A, BEAUFAYS F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]//15th Annual Conference of the International Speech Communication Association. Singapore:IEEE, 2014:338-342.
[3] SWIETOJANSKI P, GHOSHAL A, RENALS S. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition[C]//IEEE Workshop on Automatic Speech Recognition and Understanding Workshop. Olomouc, Czech Republic:IEEE, 2013:285-290.
[4] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China:IEEE, 2016:4945-4949.
[5] LU L, ZHANG X, CHO K, et al. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition[C]//16th Annual Conference of the International Speech Communication Association. Dresden, Germany:IEEE, 2015:3249-3253.
[6] YU D, XIONG W, DROPPO J, et al. Deep convolutional neural networks with layer-wise context expansion and attention[C]//17th Annual Conference of the International Speech Communication Association. San Francisco, CA, USA:IEEE, 2016:17-21.
[7] CARLETTA J. Unleashing the killer corpus:Experiences in creating the multi-everything AMI meeting corpus[J]. Language Resources and Evaluation, 2007, 41(2):181-190.
[8] BENGIO Y, SIMARD P, FRASCONI P. Learning long-term dependencies with gradient descent is diffcult[J]. IEEE Transactions on Neural Networks, 1994, 5(2):157-166.
[9] PETER B, RENALS S. Regularization of context-dependent deep neural networks with context-independent multi-task training[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane, Australia:IEEE, 2015:4290-4294.
[10] HUANG J T, LI J, YU D, et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:IEEE, 2013:7304-7308.
[11] GAO T, DU J, DAI L, et al. Joint training of front-end and back-end deep neural networks for robust speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia:IEEE, 2015:4375-4379.
[12] POVEY D, ARNAB G, GILLES B, et al. The Kaldi speech recognition toolkit[C]//IEEE Workshop on Automatic Speech Recognition and Understanding Workshop. Hawaii, USA:IEEE, 2011.
[1] 张雪英, 牛溥华, 高帆. 基于DNN-LSTM的VAD算法[J]. 清华大学学报(自然科学版), 2018, 58(5): 509-515.
[2] 努尔麦麦提·尤鲁瓦斯, 刘俊华, 吾守尔·斯拉木, 热依曼·吐尔逊, 达吾勒·阿布都哈依尔. 跨语言声学模型在维吾尔语语音识别中的应用[J]. 清华大学学报(自然科学版), 2018, 58(4): 342-346.
[3] 易江燕, 陶建华, 刘斌, 温正棋. 基于迁移学习的噪声鲁棒语音识别声学建模[J]. 清华大学学报(自然科学版), 2018, 58(1): 55-60.
[4] 王建荣, 高永春, 张句, 魏建国, 党建武. 基于Kinect辅助的机器人带噪语音识别[J]. 清华大学学报(自然科学版), 2017, 57(9): 921-925.
[5] 阳珊, 樊博, 谢磊, 王丽娟, 宋謌平. 基于BLSTM-RNN的语音驱动逼真面部动画合成[J]. 清华大学学报(自然科学版), 2017, 57(3): 250-256.
[6] 米吉提·阿不里米提, 艾克白尔·帕塔尔, 艾斯卡尔·艾木都拉. 基于层次化结构的语言模型单元集优化[J]. 清华大学学报(自然科学版), 2017, 57(3): 257-263.
[7] 张鹏远, 计哲, 侯炜, 金鑫, 韩卫生. 小资源下语音识别算法设计与优化[J]. 清华大学学报(自然科学版), 2017, 57(2): 147-152.
[8] 艾斯卡尔·肉孜, 殷实, 张之勇, 王东, 艾斯卡尔·艾木都拉, 郑方. THUYG-20:免费的维吾尔语语音数据库[J]. 清华大学学报(自然科学版), 2017, 57(2): 182-187.
[9] 王建荣, 张句, 路文焕, 魏建国, 党建武. 机器人自身噪声环境下的自动语音识别[J]. 清华大学学报(自然科学版), 2017, 57(2): 153-157.
[10] 邢安昊, 张鹏远, 潘接林, 颜永红. 基于SVD的DNN裁剪方法和重训练[J]. 清华大学学报(自然科学版), 2016, 56(7): 772-776.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn