Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2018, Vol. 58 Issue (5): 509-515    DOI: 10.16511/j.cnki.qhdxxb.2018.25.022
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于DNN-LSTM的VAD算法
张雪英, 牛溥华, 高帆
太原理工大学 信息工程学院, 太原 030024
DNN-LSTM based VAD algorithm
ZHANG Xueying, NIU Puhua, GAO Fan
College of Information Engineering, Taiyuan University of Technology, Taiyuan 030024, China
全文: PDF(1357 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 基于深度神经网络(deep neural network,DNN)的语音活动性检测(voice activity detection,VAD)忽略了声学特征在时间上的相关性,在带噪环境下性能会明显下降。该文提出了一种基于深度神经网络和长短时记忆单元(long-short term memory,LSTM)的混合网络结构应用于VAD问题。进一步对语音帧的动态信息加以分析利用,同时结合DNN-LSTM结构使用一种基于上下文信息的代价函数用于网络训练。实验语料基于TIDIGITS语音库,使用Noisex-92噪声库加噪。实验结果表明:在不同噪声环境下基于DNN-LSTM的VAD方法比基于DNN的VAD方法性能更好,新的代价函数比传统的代价函数更适用于该文提出的算法。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张雪英
牛溥华
高帆
关键词 语音活动性检测(VAD)深度神经网络(DNN)长短时记忆单元(LSTM)    
Abstract:Voice activity detection (VAD) algorithms based on deep neural networks (DNN) ignore the temporal correlation of the acoustic features between speech frames which significantly reduces the performance in noisy environments. This paper presents a hybrid deep neural network with long-short term memory (LSTM) for VAD analyses which utilizes dynamic information from the speech frames. A context information based cost function is used to train the DNN-LSTM network. The noisy speech corpus used here was based on TIDIGITS and Noisex-92. The results show that the DNN-LSTM based VAD algorithm has better recognition accuracy than DNN-based VAD algorithms in noisy environment which shows that this cost function is more suitable than the traditional cost function.
Key wordsvoice activity detection    deep neural network    long-short term memory
收稿日期: 2017-09-30      出版日期: 2018-05-17
ZTFLH:  TN912.34  
基金资助:国家自然科学基金资助项目(61371193);国家级大学生创新创业训练项目(201610112007)
作者简介: 张雪英(1964-),女,教授。E-mail:zhangxy@tyut.edu.cn
引用本文:   
张雪英, 牛溥华, 高帆. 基于DNN-LSTM的VAD算法[J]. 清华大学学报(自然科学版), 2018, 58(5): 509-515.
ZHANG Xueying, NIU Puhua, GAO Fan. DNN-LSTM based VAD algorithm. Journal of Tsinghua University(Science and Technology), 2018, 58(5): 509-515.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.25.022  或          http://jst.tsinghuajournals.com/CN/Y2018/V58/I5/509
  图1 LSTM 模型
  图2 DNNGLSTM 模型
  图3 DNNGLSTM+模型
  表1 3种 VAD方法在不同噪声条件下的对比
  图4 DNNGLSTM 与 BDNN 输出对比
  图5 采用基于上下文信息的训练方式的输出结果
  表2 2种代价函数训练的 DNNGLSTM 网络正确率
[1] BENYASSINE A, SHLOMOT E, SU H Y, et al. A robust low complexity voice activity detection algorithm for speech communication systems[C]//Speech Coding for Telecommunications Proceeding. Pocono Manor, USA:IEEE, 1997:97-98.
[2] CHO N, KIM E K. Enhanced voice activity detection using acoustic event detection and classification[J]. IEEE Transactions on Consumer Electronics, 2011, 57(1):196-202.
[3] CHANG J H, KIM N S. Voice activity detection based on complex Laplacian model[J]. Electronics Letters, 2003, 39(7):632-634.
[4] RAMIREZ J, YELAMOS P, GORRIZ J M, et al. SVM-based speech endpoint detection using contextual speech features[J]. Institution of Engineering and Technology, 2006, 42(7):426-428.
[5] ZHANG X L, WU J. Deep belief network based voice activity detection[J]. Audio, Speech, and Language Processing, 2013, 21(4):691-710.
[6] GHOSH P K, TSIARTAS A, NARAYANAN S. Robust voice activity detection using long-term signal variability[J]. IEEE Transactions on Audio Speech & Language Processing, 2011, 19(3):600-613.
[7] SALISHEV S, BARABANOV A, KOCHAROV D, et al. Voice activity detector (VAD) based on long-term Mel frequency band features[C]//International Conference on Text, Speech, and Dialogue. Brno, Czech Republic:Springer International Publishing, 2016:352-358.
[8] ZHOU Q, MA L, ZHENG Z, et al. Recurrent neural word segmentation with tag inference[M]. Kunming, China:Natural Language Understanding and Intelligent Applications Springer International Publishing, 2016.
[9] HAS,IM SAK, SENIOR A, RAO K, et al. Learning acoustic frame labeling for speech recognition with recurrent neuralnetworks[C]//International Conference on Acoustics, Speech and Signal Processing. Brisbane, Australia:IEEE, 2015:4280-4284.
[10] ZHANG X L, WANG D. Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection[J]. Speech and Signal Processing, 2014:6645-6649.
[11] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 2012, 9(8):1735-1780.
[12] GRAVES A. Supervised sequence labelling with recurrent neural networks[M]. Berlin, Germany:Springer-Verlag, 2012.
[13] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch[J]. Journal of Machine Learning Research, 2011, 12(1):2493-2537.
[14] DUCHI J, HAZAN E, SINGER Y. Adaptive subgradient methods for online learning and stochastic optimization[J]. Journal of Machine Learning Research, 2011, 12(7):2121-2159.
[15] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout:A simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research, 2014, 15(1):1929-1958.
[16] PEARCE D, HIRSCH H G. The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions[C]//The Proceedings of the 6th International Conference on Spoken Language Processing (Volume Ⅳ). Beijing, China:Interspeech, 2000:29-32.
[17] SHAO Y, JIN Z, WANG D L, et al. An auditory-based feature for robust speech recognition[C]//International Conference on Acoustics, Speech and Signal Processing. Taipei, China:IEEE, 2009:4625-4628.
[18] HE K, ZHANG X, REN S, et al. Delving deep into rectifiers:surpassing human-level performance on imagenet classification[C]//IEEE International Conference on Computer Vision. Santiago, Chile:IEEE, 2015:1026-1034.
[1] 艾斯卡尔·肉孜, 殷实, 张之勇, 王东, 艾斯卡尔·艾木都拉, 郑方. THUYG-20:免费的维吾尔语语音数据库[J]. 清华大学学报(自然科学版), 2017, 57(2): 182-187.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn