Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  百年期刊
Journal of Tsinghua University(Science and Technology)    2018, Vol. 58 Issue (3) : 254-259     DOI: 10.16511/j.cnki.qhdxxb.2018.25.015
COMPUTER SCIENCE AND TECHNOLOGY |
Expanding the length of short utterances for short-duration language recognition
MIAO Xiaoxiao1,2, ZHANG Jian1,2, SUO Hongbin1,2, ZHOU Ruohua1,2, YAN Yonghong1,2,3
1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100190, China;
3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Xinjiang 830011, China
Download: PDF(3208 KB)  
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks    
Abstract  The language recognition (LR) accuracy is often significantly reduced when the test utterance duration is as short as 10 s or less. This paper describes a method to extend the utterance length using time-scale modification (TSM) which changes the speech rate without changing the spectral information. The algorithm first converts an utterance to several time-stretched or time-compressed versions using TSM. These modified versions with different speech rates are concatenated together with the original one to form a long-duration signal, which is subsequently fed into the LR system. Tests demonstrate that this duration modification method dramatically improves the performance for short utterances.
Keywords language recognition      short-duration      time-scale modification      speech rate     
ZTFLH:  TN912.3  
Issue Date: 15 March 2018
Service
E-mail this article
E-mail Alert
RSS
Articles by authors
MIAO Xiaoxiao
ZHANG Jian
SUO Hongbin
ZHOU Ruohua
YAN Yonghong
Cite this article:   
MIAO Xiaoxiao,ZHANG Jian,SUO Hongbin, et al. Expanding the length of short utterances for short-duration language recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 254-259.
URL:  
http://jst.tsinghuajournals.com/EN/10.16511/j.cnki.qhdxxb.2018.25.015     OR     http://jst.tsinghuajournals.com/EN/Y2018/V58/I3/254
  
  
  
  
  
  
  
  
[1] LI H, MA B, LEE K. Spoken language recognition:From fundamentals to practice[J]. Proceedings of the IEEE, 2013, 101(5):1136-1159.
[2] REYNOLDS D A, QUATIERI T F, DUNN R B. Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Process, 2000, 10(1-3):19-23.
[3] DEHAK N, TORRES-CARRASQUILLO P A, REYNOLDS D A, et al. Language recognition via i-vectors and dimensionality reduction[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy:International Speech and Communication Association, 2011:857-860.
[4] CAMPBELL W M, STURIM D E, REYNOLDS D A. Support vector machines using GMM supervectors for speakers verification[J]. IEEE Signal Process Letters, 2006, 13(5):308-311.
[5] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786):504-507.
[6] YU D, SELTZER M L. Improved bottleneck features using pretrained deep neural networks[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy:International Speech and Communication Association, 2011:237-240.
[7] LEI Y, SCHEFFER N, FERRER L, et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014:1695-1699.
[8] LOPEZ-MORENO I, GONZALEZ-DOMINGUEZ J, PLCHOT O, et al. Automatic language identification using deep neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014:5337-5341.
[9] GENG W, WANG W, ZHAO Y, et al. End-to-end language identification using attention-based recurrent neural networks[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, CA, USA:International Speech and Communication Association, 2016:2944-2948.
[10] YUAN J, LIBERMAN M, CIERI C. Towards an integrated understanding of speaking rate in conversation[C]//Proceedings of the 9th International Conference on Spoken Language Processing. Pittsburgh, Pennsylvania:International Speech Communication Association, 2006:541-544.
[11] GOLDWATER S, JURAFSKY D, MANNING C D. Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates[J]. Speech Communication, 2010, 52(3):181-200.
[12] 王作英, 李健. 汉语连续语音识别的语速自适应算法[J]. 声学学报, 2003, 28(3):229-234.WANG Z Y, LI J. Speech rate adaptive algorithm for Chinese contin uous speech recognition[J]. Journal of Acoustics, 2003, 28(3):229-234. (in Chinese)
[13] HEERDEN C J, BARNARD E. Speech rate normalization used to improve speaker verification[J]. SAIEE Africa Research Journal, 2006, 98(4):129-135.
[14] WANG D, NARAYANAN S S. Robust speech rate estimation for spontaneous speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(8):2190-2201.
[15] NEJIME Y, ARITSUKA T, IMAMURA T, et al. A portable digital speech-rate converter for hearing impairment[J]. IEEE Transactions on Rehabilitation Engineering, 1996, 4(2):73-83.
[16] CHAMI M, IMMASSI M, MARTINO J D. An architectural comparison of signal reconstruction algorithms from short-time Fourier transform magnitude spectra[J]. International Journal of Speech Technology, 2015, 18(3):433-441.
[17] CHAMI M, MARTINO J D, PIERRON L, et al. Real-time signal reconstruction from short-time Fourier transform magnitude spectra using FPGAs[C]//Proceedings of the 5th International Conference on Information Systems and Economic Intelligence. Djerba, Tunisia, 2012.
[18] DORRAN D, LAWLOR R, COYLE E. High quality time-scale modification of speech using a peak alignment overlap-add algorithm (PAOLA)[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing. Hong Kong, China:IEEE, 2003:700-703
[19] DRIEDGER J, MULLER M, EWERT S. Improving time-scale modification of music signals using harmonic-percussive separation[J]. IEEE Signal Processing Letters, 2014, 21(1):105-109.
[20] BEAUREGARD G T, ZHU X, WYSE L. An efficient algorithm for real-time spectrogram inversion[C]//Proceedings of the 8th International Conference on Digital Audio Effects. Madrid, Spain:Universidad Politecnica de Madrid, 2005:116-121.
[21] ZHU X, BEAUREGARD G T, WYSE L L. Real-time signal estimation from modified short-time Fourier transform magnitude spectra[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(5):1645-1653.
[22] SARKAR A K, MATROUF D, BOUSQUET P, et al. Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association. Portland, OR, USA:International Speech and Communication Association, 2012:2661-2664.
[23] WANG M G, SONG Y, JIANG B, et al. Exemplar based language recognition method for short-duration speech segments[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:IEEE, 2013:7354-7358.
[24] CUMANI S, PLCHOT O, F'ER R. Exploiting i-vector posterior covariances for short-duration language recognition[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany:International Speech and Communication Association, 2015:1002-1006.
[25] LOZANO-DIEZ A, ZAZO-CANDⅡ R, GONZALEZ-DOMINGUEZ J, et al. An end-to-end approach to language identification in short utterances using convolutional neural networks[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany:International Speech and Communication Association, 2015:403-407.
[26] TORRES-CARRASQUILLO P A, SINGER E, KOHLERR M A, et al. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features[C]//Proceedings of the 7th International Conference on Spoken Language Processing. Denver, Colorado, USA:International Speech Communication Association, 2002:89-92.
[27] CallFriend Corpus. Linguistic data consortium[S]. (1996) http://www.ldc.upenn/ldc/about/callfriend.html.
[28] MARTIN A F, LE A N. NIST 2007 language recognition evaluation[C]//Odyssey 2008:The Speaker and Language Recognition Workshop. Stellenbosch, South Africa:IEEE, 2008:16.
[29] 王宪亮, 吴志刚, 杨金超, 等. 基于SVM一对一分类的语种识别方法[J]. 清华大学学报(自然科学版), 2013, 53(6):808-812.WANG X L, WU Z G, YANG J C, et al. A language recognition method based on SVM one to one classification[J]. Journal of Tsinghua University (Science and Technology), 2013, 53(6):808-812. (in Chinese)
[1] LI Yinghao, KONG Jiangping. Effects of speech rate on segment production in Standard Chinese[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(9): 963-969.
[2] ZHANG Jian, XU Jie, BAO Xiuguo, ZHOU Ruohua, YAN Yonghong. Weighted phone log-likelihood ratio feature for spoken language recognition[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(10): 1038-1041,1047.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
Copyright © Journal of Tsinghua University(Science and Technology), All Rights Reserved.
Powered by Beijing Magtech Co. Ltd