基于注意力LSTM和多任务学习的远场语音识别

doi:10.16511/j.cnki.qhdxxb.2018.25.016

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(1042 KB)
输出: BibTeX | EndNote (RIS)

摘要由于背景噪声、混响以及人声干扰等因素，远场语音识别任务一直充满挑战性。该文针对远场语音识别任务，提出基于注意力机制和多任务学习框架的长短时记忆递归神经网络（long short-term memory，LSTM）声学模型。模型中嵌入的注意力机制使其自动学习调整对扩展上下文特征输入的关注度，显著提升了模型对远场语音的建模能力。为进一步提高模型的鲁棒性，引入多任务学习框架，使其联合预测声学状态和干净特征。AMI数据集上的实验结果表明：与基线模型相比，引入注意力机制和多任务学习框架的LSTM模型获得了1.5%的绝对词错误率下降。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	张宇
	张鹏远
	颜永红

关键词 ：语音识别, 长短时记忆, 声学模型, 注意力机制, 多任务学习

Abstract：Distant speech recognition remains a challenging task owning to background noise, reverberation, and competing acoustic sources. This work describes a long short-term memory (LSTM) based acoustic model with an attention mechanism and a multitask learning architecture for distant speech recognition. The attention mechanism is embedded in the acoustic model to automatically tune its attention to the spliced context input which significantly improves the ability to model distant speech. A multitask learning architecture, which is trained to predict the acoustic model states and the clean features, is used to further improve the robustness. Evaluations of the model on the AMI meeting corpus show that the model reduces word error rate (WER) by 1.5% over the baseline model.

Key words： speech recognition long short-term memory acoustic model attention mechanism multitask learning

收稿日期: 2017-09-20 出版日期: 2018-03-15

ZTFLH:

TN912.34

基金资助:国家自然科学基金资助项目（U1536117，11590770，11590771，11590772，11590773，11590774）；国家重点研发计划重点专项（2016YFB0801203，2016YFB0801200）

通讯作者: 张鹏远,研究员,E-mail:zhangpengyuan@hccl.ioa.ac.cn E-mail: zhangpengyuan@hccl.ioa.ac.cn

作者简介: 张宇(1991-),女,博士研究生。

引用本文:

张宇, 张鹏远, 颜永红. 基于注意力LSTM和多任务学习的远场语音识别[J]. 清华大学学报（自然科学版）, 2018, 58(3): 249-253.
ZHANG Yu, ZHANG Pengyuan, YAN Yonghong. Long short-term memory with attention and multitask learning for distant speech recognition. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 249-253.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.25.016 或 http://jst.tsinghuajournals.com/CN/Y2018/V58/I3/249

图１　LSTM 记忆模块

图２　基于注意力机制和多任务学习框架的LSTM 声学模型

表１　LSTM 与ALSTM 模型的识别词错误率对比

表２　加入多任务学习框架后,模型的识别字错误率对比

图３　模型的帧级对数似然概率变化对比图

图４　模型的帧正确率变化对比图

[1] HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2] SAK H, SENIOR A, BEAUFAYS F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]//15th Annual Conference of the International Speech Communication Association. Singapore:IEEE, 2014:338-342.
[3] SWIETOJANSKI P, GHOSHAL A, RENALS S. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition[C]//IEEE Workshop on Automatic Speech Recognition and Understanding Workshop. Olomouc, Czech Republic:IEEE, 2013:285-290.
[4] BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China:IEEE, 2016:4945-4949.
[5] LU L, ZHANG X, CHO K, et al. A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition[C]//16th Annual Conference of the International Speech Communication Association. Dresden, Germany:IEEE, 2015:3249-3253.
[6] YU D, XIONG W, DROPPO J, et al. Deep convolutional neural networks with layer-wise context expansion and attention[C]//17th Annual Conference of the International Speech Communication Association. San Francisco, CA, USA:IEEE, 2016:17-21.
[7] CARLETTA J. Unleashing the killer corpus:Experiences in creating the multi-everything AMI meeting corpus[J]. Language Resources and Evaluation, 2007, 41(2):181-190.
[8] BENGIO Y, SIMARD P, FRASCONI P. Learning long-term dependencies with gradient descent is diffcult[J]. IEEE Transactions on Neural Networks, 1994, 5(2):157-166.
[9] PETER B, RENALS S. Regularization of context-dependent deep neural networks with context-independent multi-task training[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Brisbane, Australia:IEEE, 2015:4290-4294.
[10] HUANG J T, LI J, YU D, et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:IEEE, 2013:7304-7308.
[11] GAO T, DU J, DAI L, et al. Joint training of front-end and back-end deep neural networks for robust speech recognition[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane, Australia:IEEE, 2015:4375-4379.
[12] POVEY D, ARNAB G, GILLES B, et al. The Kaldi speech recognition toolkit[C]//IEEE Workshop on Automatic Speech Recognition and Understanding Workshop. Hawaii, USA:IEEE, 2011.

[1]	张雪芹, 刘岗, 王智能, 罗飞, 吴建华. 基于多特征融合和深度学习的微观扩散预测[J]. 清华大学学报（自然科学版）, 2024, 64(4): 688-699.
[2]	赵兴旺, 侯哲栋, 姚凯旋, 梁吉业. 基于注意力机制的两阶段融合多视图图聚类[J]. 清华大学学报（自然科学版）, 2024, 64(1): 1-12.
[3]	张名芳, 李桂林, 吴初娜, 王力, 佟良昊. 基于轻量型空间特征编码网络的驾驶人注视区域估计算法[J]. 清华大学学报（自然科学版）, 2024, 64(1): 44-54.
[4]	张洋, 江铭虎. 基于句法树节点嵌入的作者识别方法[J]. 清华大学学报（自然科学版）, 2023, 63(9): 1390-1398.
[5]	黄贲, 康飞, 唐玉. 基于目标检测的混凝土坝裂缝实时检测方法[J]. 清华大学学报（自然科学版）, 2023, 63(7): 1078-1086.
[6]	周迅, 李永龙, 周颖玥, 王皓冉, 李佳阳, 赵家琦. 基于改进DeepLabV3+网络的坝面裂缝检测方法[J]. 清华大学学报（自然科学版）, 2023, 63(7): 1153-1163.
[7]	逯波, 段晓东, 袁野. 面向跨模态检索的自监督深度语义保持Hash[J]. 清华大学学报（自然科学版）, 2022, 62(9): 1442-1449.
[8]	杨宏宇, 张梓锌, 张良. 基于并行特征提取和改进BiGRU的网络安全态势评估[J]. 清华大学学报（自然科学版）, 2022, 62(5): 842-848.
[9]	宋欣瑞, 张宪琦, 张展, 陈新昊, 刘宏伟. 多传感器数据融合的复杂人体活动识别[J]. 清华大学学报（自然科学版）, 2020, 60(10): 814-821.
[10]	刘宗林, 张梅山, 甄冉冉, 公佐权, 余南, 付国宏. 融入罪名关键词的法律判决预测多任务学习模型[J]. 清华大学学报（自然科学版）, 2019, 59(7): 497-504.
[11]	李明扬, 孔芳. 融入自注意力机制的社交媒体命名实体识别[J]. 清华大学学报（自然科学版）, 2019, 59(6): 461-467.
[12]	张雪英, 牛溥华, 高帆. 基于DNN-LSTM的VAD算法[J]. 清华大学学报（自然科学版）, 2018, 58(5): 509-515.
[13]	努尔麦麦提·尤鲁瓦斯, 刘俊华, 吾守尔·斯拉木, 热依曼·吐尔逊, 达吾勒·阿布都哈依尔. 跨语言声学模型在维吾尔语语音识别中的应用[J]. 清华大学学报（自然科学版）, 2018, 58(4): 342-346.
[14]	易江燕, 陶建华, 刘斌, 温正棋. 基于迁移学习的噪声鲁棒语音识别声学建模[J]. 清华大学学报（自然科学版）, 2018, 58(1): 55-60.
[15]	王建荣, 高永春, 张句, 魏建国, 党建武. 基于Kinect辅助的机器人带噪语音识别[J]. 清华大学学报（自然科学版）, 2017, 57(9): 921-925.

Viewed

Full text

Abstract

Cited

Shared

Discussed