Expanding the length of short utterances for short-duration language recognition
MIAO Xiaoxiao1,2, ZHANG Jian1,2, SUO Hongbin1,2, ZHOU Ruohua1,2, YAN Yonghong1,2,3
1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100190, China; 3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Xinjiang 830011, China
Abstract:The language recognition (LR) accuracy is often significantly reduced when the test utterance duration is as short as 10 s or less. This paper describes a method to extend the utterance length using time-scale modification (TSM) which changes the speech rate without changing the spectral information. The algorithm first converts an utterance to several time-stretched or time-compressed versions using TSM. These modified versions with different speech rates are concatenated together with the original one to form a long-duration signal, which is subsequently fed into the LR system. Tests demonstrate that this duration modification method dramatically improves the performance for short utterances.
苗晓晓, 张健, 索宏彬, 周若华, 颜永红. 应用于短时语音语种识别的时长扩展方法[J]. 清华大学学报(自然科学版), 2018, 58(3): 254-259.
MIAO Xiaoxiao, ZHANG Jian, SUO Hongbin, ZHOU Ruohua, YAN Yonghong. Expanding the length of short utterances for short-duration language recognition. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 254-259.
[1] LI H, MA B, LEE K. Spoken language recognition:From fundamentals to practice[J]. Proceedings of the IEEE, 2013, 101(5):1136-1159. [2] REYNOLDS D A, QUATIERI T F, DUNN R B. Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Process, 2000, 10(1-3):19-23. [3] DEHAK N, TORRES-CARRASQUILLO P A, REYNOLDS D A, et al. Language recognition via i-vectors and dimensionality reduction[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy:International Speech and Communication Association, 2011:857-860. [4] CAMPBELL W M, STURIM D E, REYNOLDS D A. Support vector machines using GMM supervectors for speakers verification[J]. IEEE Signal Process Letters, 2006, 13(5):308-311. [5] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786):504-507. [6] YU D, SELTZER M L. Improved bottleneck features using pretrained deep neural networks[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy:International Speech and Communication Association, 2011:237-240. [7] LEI Y, SCHEFFER N, FERRER L, et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014:1695-1699. [8] LOPEZ-MORENO I, GONZALEZ-DOMINGUEZ J, PLCHOT O, et al. Automatic language identification using deep neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014:5337-5341. [9] GENG W, WANG W, ZHAO Y, et al. End-to-end language identification using attention-based recurrent neural networks[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, CA, USA:International Speech and Communication Association, 2016:2944-2948. [10] YUAN J, LIBERMAN M, CIERI C. Towards an integrated understanding of speaking rate in conversation[C]//Proceedings of the 9th International Conference on Spoken Language Processing. Pittsburgh, Pennsylvania:International Speech Communication Association, 2006:541-544. [11] GOLDWATER S, JURAFSKY D, MANNING C D. Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates[J]. Speech Communication, 2010, 52(3):181-200. [12] 王作英, 李健. 汉语连续语音识别的语速自适应算法[J]. 声学学报, 2003, 28(3):229-234.WANG Z Y, LI J. Speech rate adaptive algorithm for Chinese contin uous speech recognition[J]. Journal of Acoustics, 2003, 28(3):229-234. (in Chinese) [13] HEERDEN C J, BARNARD E. Speech rate normalization used to improve speaker verification[J]. SAIEE Africa Research Journal, 2006, 98(4):129-135. [14] WANG D, NARAYANAN S S. Robust speech rate estimation for spontaneous speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(8):2190-2201. [15] NEJIME Y, ARITSUKA T, IMAMURA T, et al. A portable digital speech-rate converter for hearing impairment[J]. IEEE Transactions on Rehabilitation Engineering, 1996, 4(2):73-83. [16] CHAMI M, IMMASSI M, MARTINO J D. An architectural comparison of signal reconstruction algorithms from short-time Fourier transform magnitude spectra[J]. International Journal of Speech Technology, 2015, 18(3):433-441. [17] CHAMI M, MARTINO J D, PIERRON L, et al. Real-time signal reconstruction from short-time Fourier transform magnitude spectra using FPGAs[C]//Proceedings of the 5th International Conference on Information Systems and Economic Intelligence. Djerba, Tunisia, 2012. [18] DORRAN D, LAWLOR R, COYLE E. High quality time-scale modification of speech using a peak alignment overlap-add algorithm (PAOLA)[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing. Hong Kong, China:IEEE, 2003:700-703 [19] DRIEDGER J, MULLER M, EWERT S. Improving time-scale modification of music signals using harmonic-percussive separation[J]. IEEE Signal Processing Letters, 2014, 21(1):105-109. [20] BEAUREGARD G T, ZHU X, WYSE L. An efficient algorithm for real-time spectrogram inversion[C]//Proceedings of the 8th International Conference on Digital Audio Effects. Madrid, Spain:Universidad Politecnica de Madrid, 2005:116-121. [21] ZHU X, BEAUREGARD G T, WYSE L L. Real-time signal estimation from modified short-time Fourier transform magnitude spectra[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(5):1645-1653. [22] SARKAR A K, MATROUF D, BOUSQUET P, et al. Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association. Portland, OR, USA:International Speech and Communication Association, 2012:2661-2664. [23] WANG M G, SONG Y, JIANG B, et al. Exemplar based language recognition method for short-duration speech segments[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:IEEE, 2013:7354-7358. [24] CUMANI S, PLCHOT O, F'ER R. Exploiting i-vector posterior covariances for short-duration language recognition[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany:International Speech and Communication Association, 2015:1002-1006. [25] LOZANO-DIEZ A, ZAZO-CANDⅡ R, GONZALEZ-DOMINGUEZ J, et al. An end-to-end approach to language identification in short utterances using convolutional neural networks[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany:International Speech and Communication Association, 2015:403-407. [26] TORRES-CARRASQUILLO P A, SINGER E, KOHLERR M A, et al. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features[C]//Proceedings of the 7th International Conference on Spoken Language Processing. Denver, Colorado, USA:International Speech Communication Association, 2002:89-92. [27] CallFriend Corpus. Linguistic data consortium[S]. (1996) http://www.ldc.upenn/ldc/about/callfriend.html. [28] MARTIN A F, LE A N. NIST 2007 language recognition evaluation[C]//Odyssey 2008:The Speaker and Language Recognition Workshop. Stellenbosch, South Africa:IEEE, 2008:16. [29] 王宪亮, 吴志刚, 杨金超, 等. 基于SVM一对一分类的语种识别方法[J]. 清华大学学报(自然科学版), 2013, 53(6):808-812.WANG X L, WU Z G, YANG J C, et al. A language recognition method based on SVM one to one classification[J]. Journal of Tsinghua University (Science and Technology), 2013, 53(6):808-812. (in Chinese)