Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  百年期刊
Journal of Tsinghua University(Science and Technology)    2018, Vol. 58 Issue (3) : 249-253     DOI: 10.16511/j.cnki.qhdxxb.2018.25.016
COMPUTER SCIENCE AND TECHNOLOGY |
Long short-term memory with attention and multitask learning for distant speech recognition
ZHANG Yu1,2, ZHANG Pengyuan1,2, YAN Yonghong1,2,3
1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Urumchi 830011, China
Download: PDF(1042 KB)  
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks    
Abstract  Distant speech recognition remains a challenging task owning to background noise, reverberation, and competing acoustic sources. This work describes a long short-term memory (LSTM) based acoustic model with an attention mechanism and a multitask learning architecture for distant speech recognition. The attention mechanism is embedded in the acoustic model to automatically tune its attention to the spliced context input which significantly improves the ability to model distant speech. A multitask learning architecture, which is trained to predict the acoustic model states and the clean features, is used to further improve the robustness. Evaluations of the model on the AMI meeting corpus show that the model reduces word error rate (WER) by 1.5% over the baseline model.
Keywords speech recognition      long short-term memory      acoustic model      attention mechanism      multitask learning     
ZTFLH:  TN912.34  
Issue Date: 15 March 2018
Service
E-mail this article
E-mail Alert
RSS
Articles by authors
ZHANG Yu
ZHANG Pengyuan
YAN Yonghong
Cite this article:   
ZHANG Yu,ZHANG Pengyuan,YAN Yonghong. Long short-term memory with attention and multitask learning for distant speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 249-253.
URL:  
http://jst.tsinghuajournals.com/EN/10.16511/j.cnki.qhdxxb.2018.25.016     OR     http://jst.tsinghuajournals.com/EN/Y2018/V58/I3/249
  
  
  
  
  
  
[1] HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2] SAK H, SENIOR A, BEAUFAYS F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]//15th Annual Conference of the International Speech Communication Association. Singapore:IEEE, 2014:338-342.
[3] SWIETOJANSKI P, GHOSHAL A, RENALS S. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition[C]//IEEE Workshop on Automatic Speech Recognition and Understanding Workshop. Olomouc, Czech Republic:IEEE, 2013:285-290.
[4] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China:IEEE, 2016:4945-4949.
[5] LU L, ZHANG X, CHO K, et al. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition[C]//16th Annual Conference of the International Speech Communication Association. Dresden, Germany:IEEE, 2015:3249-3253.
[6] YU D, XIONG W, DROPPO J, et al. Deep convolutional neural networks with layer-wise context expansion and attention[C]//17th Annual Conference of the International Speech Communication Association. San Francisco, CA, USA:IEEE, 2016:17-21.
[7] CARLETTA J. Unleashing the killer corpus:Experiences in creating the multi-everything AMI meeting corpus[J]. Language Resources and Evaluation, 2007, 41(2):181-190.
[8] BENGIO Y, SIMARD P, FRASCONI P. Learning long-term dependencies with gradient descent is diffcult[J]. IEEE Transactions on Neural Networks, 1994, 5(2):157-166.
[9] PETER B, RENALS S. Regularization of context-dependent deep neural networks with context-independent multi-task training[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane, Australia:IEEE, 2015:4290-4294.
[10] HUANG J T, LI J, YU D, et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:IEEE, 2013:7304-7308.
[11] GAO T, DU J, DAI L, et al. Joint training of front-end and back-end deep neural networks for robust speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia:IEEE, 2015:4375-4379.
[12] POVEY D, ARNAB G, GILLES B, et al. The Kaldi speech recognition toolkit[C]//IEEE Workshop on Automatic Speech Recognition and Understanding Workshop. Hawaii, USA:IEEE, 2011.
[1] ZHANG Xueqin, LIU Gang, WANG Zhineng, LUO Fei, WU Jianhua. Microscopic diffusion prediction based on multifeature fusion and deep learning[J]. Journal of Tsinghua University(Science and Technology), 2024, 64(4): 688-699.
[2] ZHAO Xingwang, HOU Zhedong, YAO Kaixuan, LIANG Jiye. Two-stage fusion multiview graph clustering based on the attention mechanism[J]. Journal of Tsinghua University(Science and Technology), 2024, 64(1): 1-12.
[3] ZHANG Mingfang, LI Guilin, WU Chuna, WANG Li, TONG Lianghao. Estimation algorithm of driver's gaze zone based on lightweight spatial feature encoding network[J]. Journal of Tsinghua University(Science and Technology), 2024, 64(1): 44-54.
[4] ZHANG Yang, JIANG Minghu. Authorship identification method based on the embedding of the syntax tree node[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(9): 1390-1398.
[5] HUANG Ben, KANG Fei, TANG Yu. A real-time detection method for concrete dam cracks based on an object detection algorithm[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(7): 1078-1086.
[6] ZHOU Xun, LI Yonglong, ZHOU Yingyue, WANG Haoran, LI Jiayang, ZHAO Jiaqi. Dam surface crack detection method based on improved DeepLabV3+ network[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(7): 1153-1163.
[7] DAI Xin, HUANG Hong, JI Xinyu, WANG Wei. Spatiotemporal rapid prediction model of urban rainstorm waterlogging based on machine learning[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(6): 865-873.
[8] GUO Shiyuan, MA Weizhi, LU Ruilin, LIU Jinlong, YANG Zhigang, WANG Zhongjing, ZHANG Min. Prediction of canal discharge under complex conditions based on a long short-term memory neural network[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(12): 1924-1934.
[9] CHEN Chuangang, HU Jinqiu, HAN Zicong, CHEN Yiyue, XIAO Shangrui. Knowledge graph based early warning method for accident evolution in natural gas pipeline station abroad for harsh environmental conditions[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(6): 1081-1087.
[10] YANG Hongyu, ZHANG Zixin, ZHANG Liang. Network security situation assessments with parallel feature extraction and an improved BiGRU[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(5): 842-848.
[11] LI Mingyang, KONG Fang. Combined self-attention mechanism for named entity recognition in social media[J]. Journal of Tsinghua University(Science and Technology), 2019, 59(6): 461-467.
[12] NURMEMET Yolwas, LIU Junhua, WUSHOUR Silamu, REYIMAN Tursun, DAWEL Abilhayer. Crosslingual acoustic modeling in Uyghur speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(4): 342-346.
[13] YI Jiangyan, TAO Jianhua, LIU Bin, WEN Zhengqi. Transfer learning for acoustic modeling of noise robust speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(1): 55-60.
[14] WANG Jianrong, GAO Yongchun, ZHANG Ju, WEI Jianguo, DANG Jianwu. Automatic speech recognition by a Kinect sensor for a robot under ego noises[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(9): 921-925.
[15] Mijit Ablimit, Akbar Pattar, Askar Hamdulla. Multilayer structure based lexicon optimization for language modeling[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(3): 257-263.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
Copyright © Journal of Tsinghua University(Science and Technology), All Rights Reserved.
Powered by Beijing Magtech Co. Ltd