COMPUTER SCIENCE AND TECHNOLOGY |
|
|
|
|
|
Expanding the length of short utterances for short-duration language recognition |
MIAO Xiaoxiao1,2, ZHANG Jian1,2, SUO Hongbin1,2, ZHOU Ruohua1,2, YAN Yonghong1,2,3 |
1. Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100190, China; 3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Xinjiang 830011, China |
|
|
Abstract The language recognition (LR) accuracy is often significantly reduced when the test utterance duration is as short as 10 s or less. This paper describes a method to extend the utterance length using time-scale modification (TSM) which changes the speech rate without changing the spectral information. The algorithm first converts an utterance to several time-stretched or time-compressed versions using TSM. These modified versions with different speech rates are concatenated together with the original one to form a long-duration signal, which is subsequently fed into the LR system. Tests demonstrate that this duration modification method dramatically improves the performance for short utterances.
|
Keywords
language recognition
short-duration
time-scale modification
speech rate
|
|
Issue Date: 15 March 2018
|
|
|
[1] LI H, MA B, LEE K. Spoken language recognition:From fundamentals to practice[J]. Proceedings of the IEEE, 2013, 101(5):1136-1159. [2] REYNOLDS D A, QUATIERI T F, DUNN R B. Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Process, 2000, 10(1-3):19-23. [3] DEHAK N, TORRES-CARRASQUILLO P A, REYNOLDS D A, et al. Language recognition via i-vectors and dimensionality reduction[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy:International Speech and Communication Association, 2011:857-860. [4] CAMPBELL W M, STURIM D E, REYNOLDS D A. Support vector machines using GMM supervectors for speakers verification[J]. IEEE Signal Process Letters, 2006, 13(5):308-311. [5] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786):504-507. [6] YU D, SELTZER M L. Improved bottleneck features using pretrained deep neural networks[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy:International Speech and Communication Association, 2011:237-240. [7] LEI Y, SCHEFFER N, FERRER L, et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014:1695-1699. [8] LOPEZ-MORENO I, GONZALEZ-DOMINGUEZ J, PLCHOT O, et al. Automatic language identification using deep neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014:5337-5341. [9] GENG W, WANG W, ZHAO Y, et al. End-to-end language identification using attention-based recurrent neural networks[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, CA, USA:International Speech and Communication Association, 2016:2944-2948. [10] YUAN J, LIBERMAN M, CIERI C. Towards an integrated understanding of speaking rate in conversation[C]//Proceedings of the 9th International Conference on Spoken Language Processing. Pittsburgh, Pennsylvania:International Speech Communication Association, 2006:541-544. [11] GOLDWATER S, JURAFSKY D, MANNING C D. Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates[J]. Speech Communication, 2010, 52(3):181-200. [12] 王作英, 李健. 汉语连续语音识别的语速自适应算法[J]. 声学学报, 2003, 28(3):229-234.WANG Z Y, LI J. Speech rate adaptive algorithm for Chinese contin uous speech recognition[J]. Journal of Acoustics, 2003, 28(3):229-234. (in Chinese) [13] HEERDEN C J, BARNARD E. Speech rate normalization used to improve speaker verification[J]. SAIEE Africa Research Journal, 2006, 98(4):129-135. [14] WANG D, NARAYANAN S S. Robust speech rate estimation for spontaneous speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(8):2190-2201. [15] NEJIME Y, ARITSUKA T, IMAMURA T, et al. A portable digital speech-rate converter for hearing impairment[J]. IEEE Transactions on Rehabilitation Engineering, 1996, 4(2):73-83. [16] CHAMI M, IMMASSI M, MARTINO J D. An architectural comparison of signal reconstruction algorithms from short-time Fourier transform magnitude spectra[J]. International Journal of Speech Technology, 2015, 18(3):433-441. [17] CHAMI M, MARTINO J D, PIERRON L, et al. Real-time signal reconstruction from short-time Fourier transform magnitude spectra using FPGAs[C]//Proceedings of the 5th International Conference on Information Systems and Economic Intelligence. Djerba, Tunisia, 2012. [18] DORRAN D, LAWLOR R, COYLE E. High quality time-scale modification of speech using a peak alignment overlap-add algorithm (PAOLA)[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing. Hong Kong, China:IEEE, 2003:700-703 [19] DRIEDGER J, MULLER M, EWERT S. Improving time-scale modification of music signals using harmonic-percussive separation[J]. IEEE Signal Processing Letters, 2014, 21(1):105-109. [20] BEAUREGARD G T, ZHU X, WYSE L. An efficient algorithm for real-time spectrogram inversion[C]//Proceedings of the 8th International Conference on Digital Audio Effects. Madrid, Spain:Universidad Politecnica de Madrid, 2005:116-121. [21] ZHU X, BEAUREGARD G T, WYSE L L. Real-time signal estimation from modified short-time Fourier transform magnitude spectra[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(5):1645-1653. [22] SARKAR A K, MATROUF D, BOUSQUET P, et al. Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association. Portland, OR, USA:International Speech and Communication Association, 2012:2661-2664. [23] WANG M G, SONG Y, JIANG B, et al. Exemplar based language recognition method for short-duration speech segments[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:IEEE, 2013:7354-7358. [24] CUMANI S, PLCHOT O, F'ER R. Exploiting i-vector posterior covariances for short-duration language recognition[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany:International Speech and Communication Association, 2015:1002-1006. [25] LOZANO-DIEZ A, ZAZO-CANDⅡ R, GONZALEZ-DOMINGUEZ J, et al. An end-to-end approach to language identification in short utterances using convolutional neural networks[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany:International Speech and Communication Association, 2015:403-407. [26] TORRES-CARRASQUILLO P A, SINGER E, KOHLERR M A, et al. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features[C]//Proceedings of the 7th International Conference on Spoken Language Processing. Denver, Colorado, USA:International Speech Communication Association, 2002:89-92. [27] CallFriend Corpus. Linguistic data consortium[S]. (1996) http://www.ldc.upenn/ldc/about/callfriend.html. [28] MARTIN A F, LE A N. NIST 2007 language recognition evaluation[C]//Odyssey 2008:The Speaker and Language Recognition Workshop. Stellenbosch, South Africa:IEEE, 2008:16. [29] 王宪亮, 吴志刚, 杨金超, 等. 基于SVM一对一分类的语种识别方法[J]. 清华大学学报(自然科学版), 2013, 53(6):808-812.WANG X L, WU Z G, YANG J C, et al. A language recognition method based on SVM one to one classification[J]. Journal of Tsinghua University (Science and Technology), 2013, 53(6):808-812. (in Chinese) |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|