Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2018, Vol. 58 Issue (3): 254-259    DOI: 10.16511/j.cnki.qhdxxb.2018.25.015
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
应用于短时语音语种识别的时长扩展方法
苗晓晓1,2, 张健1,2, 索宏彬1,2, 周若华1,2, 颜永红1,2,3
1. 中国科学院 声学研究所, 语言声学与内容理解重点实验室, 北京 100190;
2. 中国科学院大学, 北京 100190;
3. 中国科学院 新疆理化技术研究所, 新疆民族语音语言信息处理实验室, 新疆 830011
Expanding the length of short utterances for short-duration language recognition
MIAO Xiaoxiao1,2, ZHANG Jian1,2, SUO Hongbin1,2, ZHOU Ruohua1,2, YAN Yonghong1,2,3
1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100190, China;
3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Xinjiang 830011, China
全文: PDF(3208 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 为解决待识别语音时长小于10 s时,语种识别性能急剧下降的问题,该文提出应用语音时域伸缩(time-scale modification,TSM)技术改变语音的长度(从而改变了语速),并保持其他频域信息不变。首先,对一段待识别语音,应用TSM技术转换为多条时域压缩和时域拉伸后的语音;其次,将这些不同语速的语音与原语音拼接起来,生成一个时长较长的语音;最后,送入语种识别系统进行识别。实验结果表明:所提出的语音时长扩展算法可以显著提升短时语音的语种识别性能。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
苗晓晓
张健
索宏彬
周若华
颜永红
关键词 语种识别短时时域伸缩语速    
Abstract:The language recognition (LR) accuracy is often significantly reduced when the test utterance duration is as short as 10 s or less. This paper describes a method to extend the utterance length using time-scale modification (TSM) which changes the speech rate without changing the spectral information. The algorithm first converts an utterance to several time-stretched or time-compressed versions using TSM. These modified versions with different speech rates are concatenated together with the original one to form a long-duration signal, which is subsequently fed into the LR system. Tests demonstrate that this duration modification method dramatically improves the performance for short utterances.
Key wordslanguage recognition    short-duration    time-scale modification    speech rate
收稿日期: 2017-09-26      出版日期: 2018-03-15
ZTFLH:  TN912.3  
基金资助:国家重点研发计划重点专项(2016YFB0801203,2016YFB0801200);国家自然科学基金资助项目(11590770-4,U1536117,11504406,11461141004)
通讯作者: 周若华,研究员,E-mail:zhouruohua@hccl.ioa.ac.cn     E-mail: zhouruohua@hccl.ioa.ac.cn
作者简介: 苗晓晓(1994-),女,博士研究生。
引用本文:   
苗晓晓, 张健, 索宏彬, 周若华, 颜永红. 应用于短时语音语种识别的时长扩展方法[J]. 清华大学学报(自然科学版), 2018, 58(3): 254-259.
MIAO Xiaoxiao, ZHANG Jian, SUO Hongbin, ZHOU Ruohua, YAN Yonghong. Expanding the length of short utterances for short-duration language recognition. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 254-259.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.25.015  或          http://jst.tsinghuajournals.com/CN/Y2018/V58/I3/254
  图1 RTISI算法图示
  图2 RTISI算法处理示例语音波形图
  图3 RTISI算法处理示例语音频谱图
  表1 原语音和经过TSM 变换后语音的识别性能对比(%)
  表2 原语音和经过TSM 变换后语音拼接的识别性能(%)
  表3 拼接一条慢速语音和一条快速语音的性能(%)
  表4 拼接4条不同语速语音的性能(%)
  表5 拼接6条和8条不同语速语音的性能(%)
[1] LI H, MA B, LEE K. Spoken language recognition:From fundamentals to practice[J]. Proceedings of the IEEE, 2013, 101(5):1136-1159.
[2] REYNOLDS D A, QUATIERI T F, DUNN R B. Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Process, 2000, 10(1-3):19-23.
[3] DEHAK N, TORRES-CARRASQUILLO P A, REYNOLDS D A, et al. Language recognition via i-vectors and dimensionality reduction[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy:International Speech and Communication Association, 2011:857-860.
[4] CAMPBELL W M, STURIM D E, REYNOLDS D A. Support vector machines using GMM supervectors for speakers verification[J]. IEEE Signal Process Letters, 2006, 13(5):308-311.
[5] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786):504-507.
[6] YU D, SELTZER M L. Improved bottleneck features using pretrained deep neural networks[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy:International Speech and Communication Association, 2011:237-240.
[7] LEI Y, SCHEFFER N, FERRER L, et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014:1695-1699.
[8] LOPEZ-MORENO I, GONZALEZ-DOMINGUEZ J, PLCHOT O, et al. Automatic language identification using deep neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014:5337-5341.
[9] GENG W, WANG W, ZHAO Y, et al. End-to-end language identification using attention-based recurrent neural networks[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, CA, USA:International Speech and Communication Association, 2016:2944-2948.
[10] YUAN J, LIBERMAN M, CIERI C. Towards an integrated understanding of speaking rate in conversation[C]//Proceedings of the 9th International Conference on Spoken Language Processing. Pittsburgh, Pennsylvania:International Speech Communication Association, 2006:541-544.
[11] GOLDWATER S, JURAFSKY D, MANNING C D. Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates[J]. Speech Communication, 2010, 52(3):181-200.
[12] 王作英, 李健. 汉语连续语音识别的语速自适应算法[J]. 声学学报, 2003, 28(3):229-234.WANG Z Y, LI J. Speech rate adaptive algorithm for Chinese contin uous speech recognition[J]. Journal of Acoustics, 2003, 28(3):229-234. (in Chinese)
[13] HEERDEN C J, BARNARD E. Speech rate normalization used to improve speaker verification[J]. SAIEE Africa Research Journal, 2006, 98(4):129-135.
[14] WANG D, NARAYANAN S S. Robust speech rate estimation for spontaneous speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(8):2190-2201.
[15] NEJIME Y, ARITSUKA T, IMAMURA T, et al. A portable digital speech-rate converter for hearing impairment[J]. IEEE Transactions on Rehabilitation Engineering, 1996, 4(2):73-83.
[16] CHAMI M, IMMASSI M, MARTINO J D. An architectural comparison of signal reconstruction algorithms from short-time Fourier transform magnitude spectra[J]. International Journal of Speech Technology, 2015, 18(3):433-441.
[17] CHAMI M, MARTINO J D, PIERRON L, et al. Real-time signal reconstruction from short-time Fourier transform magnitude spectra using FPGAs[C]//Proceedings of the 5th International Conference on Information Systems and Economic Intelligence. Djerba, Tunisia, 2012.
[18] DORRAN D, LAWLOR R, COYLE E. High quality time-scale modification of speech using a peak alignment overlap-add algorithm (PAOLA)[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing. Hong Kong, China:IEEE, 2003:700-703
[19] DRIEDGER J, MULLER M, EWERT S. Improving time-scale modification of music signals using harmonic-percussive separation[J]. IEEE Signal Processing Letters, 2014, 21(1):105-109.
[20] BEAUREGARD G T, ZHU X, WYSE L. An efficient algorithm for real-time spectrogram inversion[C]//Proceedings of the 8th International Conference on Digital Audio Effects. Madrid, Spain:Universidad Politecnica de Madrid, 2005:116-121.
[21] ZHU X, BEAUREGARD G T, WYSE L L. Real-time signal estimation from modified short-time Fourier transform magnitude spectra[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(5):1645-1653.
[22] SARKAR A K, MATROUF D, BOUSQUET P, et al. Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association. Portland, OR, USA:International Speech and Communication Association, 2012:2661-2664.
[23] WANG M G, SONG Y, JIANG B, et al. Exemplar based language recognition method for short-duration speech segments[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:IEEE, 2013:7354-7358.
[24] CUMANI S, PLCHOT O, F'ER R. Exploiting i-vector posterior covariances for short-duration language recognition[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany:International Speech and Communication Association, 2015:1002-1006.
[25] LOZANO-DIEZ A, ZAZO-CANDⅡ R, GONZALEZ-DOMINGUEZ J, et al. An end-to-end approach to language identification in short utterances using convolutional neural networks[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany:International Speech and Communication Association, 2015:403-407.
[26] TORRES-CARRASQUILLO P A, SINGER E, KOHLERR M A, et al. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features[C]//Proceedings of the 7th International Conference on Spoken Language Processing. Denver, Colorado, USA:International Speech Communication Association, 2002:89-92.
[27] CallFriend Corpus. Linguistic data consortium[S]. (1996) http://www.ldc.upenn/ldc/about/callfriend.html.
[28] MARTIN A F, LE A N. NIST 2007 language recognition evaluation[C]//Odyssey 2008:The Speaker and Language Recognition Workshop. Stellenbosch, South Africa:IEEE, 2008:16.
[29] 王宪亮, 吴志刚, 杨金超, 等. 基于SVM一对一分类的语种识别方法[J]. 清华大学学报(自然科学版), 2013, 53(6):808-812.WANG X L, WU Z G, YANG J C, et al. A language recognition method based on SVM one to one classification[J]. Journal of Tsinghua University (Science and Technology), 2013, 53(6):808-812. (in Chinese)
[1] 张雪英, 牛溥华, 高帆. 基于DNN-LSTM的VAD算法[J]. 清华大学学报(自然科学版), 2018, 58(5): 509-515.
[2] 艾斯卡尔·肉孜, 王东, 李蓝天, 郑方, 张晓东, 金磐石. 说话人识别中的分数域语速归一化[J]. 清华大学学报(自然科学版), 2018, 58(4): 337-341.
[3] 张宇, 张鹏远, 颜永红. 基于注意力LSTM和多任务学习的远场语音识别[J]. 清华大学学报(自然科学版), 2018, 58(3): 249-253.
[4] 李英浩, 孔江平. 语速对普通话音段产生的影响[J]. 清华大学学报(自然科学版), 2017, 57(9): 963-969.
[5] 曹洪林, 王宇靖, 李敬阳. 语速对三合元音共振峰动态特征的影响[J]. 清华大学学报(自然科学版), 2017, 57(9): 958-962.
[6] 阳珊, 樊博, 谢磊, 王丽娟, 宋謌平. 基于BLSTM-RNN的语音驱动逼真面部动画合成[J]. 清华大学学报(自然科学版), 2017, 57(3): 250-256.
[7] 张健, 徐杰, 包秀国, 周若华, 颜永红. 应用于语种识别的加权音素对数似然比特征[J]. 清华大学学报(自然科学版), 2017, 57(10): 1038-1041,1047.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn