应用于短时语音语种识别的时长扩展方法

苗晓晓, 张健, 索宏彬, 周若华, 颜永红

清华大学学报(自然科学版) ›› 2018, Vol. 58 ›› Issue (3) : 254-259.

PDF(3208 KB)
PDF(3208 KB)
清华大学学报(自然科学版) ›› 2018, Vol. 58 ›› Issue (3) : 254-259. DOI: 10.16511/j.cnki.qhdxxb.2018.25.015
计算机科学与技术

应用于短时语音语种识别的时长扩展方法

  • 苗晓晓1,2, 张健1,2, 索宏彬1,2, 周若华1,2, 颜永红1,2,3
作者信息 +

Expanding the length of short utterances for short-duration language recognition

  • MIAO Xiaoxiao1,2, ZHANG Jian1,2, SUO Hongbin1,2, ZHOU Ruohua1,2, YAN Yonghong1,2,3
Author information +
文章历史 +

摘要

为解决待识别语音时长小于10 s时,语种识别性能急剧下降的问题,该文提出应用语音时域伸缩(time-scale modification,TSM)技术改变语音的长度(从而改变了语速),并保持其他频域信息不变。首先,对一段待识别语音,应用TSM技术转换为多条时域压缩和时域拉伸后的语音;其次,将这些不同语速的语音与原语音拼接起来,生成一个时长较长的语音;最后,送入语种识别系统进行识别。实验结果表明:所提出的语音时长扩展算法可以显著提升短时语音的语种识别性能。

Abstract

The language recognition (LR) accuracy is often significantly reduced when the test utterance duration is as short as 10 s or less. This paper describes a method to extend the utterance length using time-scale modification (TSM) which changes the speech rate without changing the spectral information. The algorithm first converts an utterance to several time-stretched or time-compressed versions using TSM. These modified versions with different speech rates are concatenated together with the original one to form a long-duration signal, which is subsequently fed into the LR system. Tests demonstrate that this duration modification method dramatically improves the performance for short utterances.

关键词

语种识别 / 短时 / 时域伸缩 / 语速

Key words

language recognition / short-duration / time-scale modification / speech rate

引用本文

导出引用
苗晓晓, 张健, 索宏彬, 周若华, 颜永红. 应用于短时语音语种识别的时长扩展方法[J]. 清华大学学报(自然科学版). 2018, 58(3): 254-259 https://doi.org/10.16511/j.cnki.qhdxxb.2018.25.015
MIAO Xiaoxiao, ZHANG Jian, SUO Hongbin, ZHOU Ruohua, YAN Yonghong. Expanding the length of short utterances for short-duration language recognition[J]. Journal of Tsinghua University(Science and Technology). 2018, 58(3): 254-259 https://doi.org/10.16511/j.cnki.qhdxxb.2018.25.015
中图分类号: TN912.3   

参考文献

[1] LI H, MA B, LEE K. Spoken language recognition:From fundamentals to practice[J]. Proceedings of the IEEE, 2013, 101(5):1136-1159.
[2] REYNOLDS D A, QUATIERI T F, DUNN R B. Speaker verification using adapted Gaussian mixture models[J]. Digital Signal Process, 2000, 10(1-3):19-23.
[3] DEHAK N, TORRES-CARRASQUILLO P A, REYNOLDS D A, et al. Language recognition via i-vectors and dimensionality reduction[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy:International Speech and Communication Association, 2011:857-860.
[4] CAMPBELL W M, STURIM D E, REYNOLDS D A. Support vector machines using GMM supervectors for speakers verification[J]. IEEE Signal Process Letters, 2006, 13(5):308-311.
[5] HINTON G E, SALAKHUTDINOV R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786):504-507.
[6] YU D, SELTZER M L. Improved bottleneck features using pretrained deep neural networks[C]//Proceedings of the 12th Annual Conference of the International Speech Communication Association. Florence, Italy:International Speech and Communication Association, 2011:237-240.
[7] LEI Y, SCHEFFER N, FERRER L, et al. A novel scheme for speaker recognition using a phonetically-aware deep neural network[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014:1695-1699.
[8] LOPEZ-MORENO I, GONZALEZ-DOMINGUEZ J, PLCHOT O, et al. Automatic language identification using deep neural networks[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy:IEEE, 2014:5337-5341.
[9] GENG W, WANG W, ZHAO Y, et al. End-to-end language identification using attention-based recurrent neural networks[C]//Proceedings of the 17th Annual Conference of the International Speech Communication Association. San Francisco, CA, USA:International Speech and Communication Association, 2016:2944-2948.
[10] YUAN J, LIBERMAN M, CIERI C. Towards an integrated understanding of speaking rate in conversation[C]//Proceedings of the 9th International Conference on Spoken Language Processing. Pittsburgh, Pennsylvania:International Speech Communication Association, 2006:541-544.
[11] GOLDWATER S, JURAFSKY D, MANNING C D. Which words are hard to recognize? prosodic, lexical, and disfluency factors that increase speech recognition error rates[J]. Speech Communication, 2010, 52(3):181-200.
[12] 王作英, 李健. 汉语连续语音识别的语速自适应算法[J]. 声学学报, 2003, 28(3):229-234.WANG Z Y, LI J. Speech rate adaptive algorithm for Chinese contin uous speech recognition[J]. Journal of Acoustics, 2003, 28(3):229-234. (in Chinese)
[13] HEERDEN C J, BARNARD E. Speech rate normalization used to improve speaker verification[J]. SAIEE Africa Research Journal, 2006, 98(4):129-135.
[14] WANG D, NARAYANAN S S. Robust speech rate estimation for spontaneous speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(8):2190-2201.
[15] NEJIME Y, ARITSUKA T, IMAMURA T, et al. A portable digital speech-rate converter for hearing impairment[J]. IEEE Transactions on Rehabilitation Engineering, 1996, 4(2):73-83.
[16] CHAMI M, IMMASSI M, MARTINO J D. An architectural comparison of signal reconstruction algorithms from short-time Fourier transform magnitude spectra[J]. International Journal of Speech Technology, 2015, 18(3):433-441.
[17] CHAMI M, MARTINO J D, PIERRON L, et al. Real-time signal reconstruction from short-time Fourier transform magnitude spectra using FPGAs[C]//Proceedings of the 5th International Conference on Information Systems and Economic Intelligence. Djerba, Tunisia, 2012.
[18] DORRAN D, LAWLOR R, COYLE E. High quality time-scale modification of speech using a peak alignment overlap-add algorithm (PAOLA)[C]//IEEE International Conference on Acoustics, Speech, and Signal Processing. Hong Kong, China:IEEE, 2003:700-703
[19] DRIEDGER J, MULLER M, EWERT S. Improving time-scale modification of music signals using harmonic-percussive separation[J]. IEEE Signal Processing Letters, 2014, 21(1):105-109.
[20] BEAUREGARD G T, ZHU X, WYSE L. An efficient algorithm for real-time spectrogram inversion[C]//Proceedings of the 8th International Conference on Digital Audio Effects. Madrid, Spain:Universidad Politecnica de Madrid, 2005:116-121.
[21] ZHU X, BEAUREGARD G T, WYSE L L. Real-time signal estimation from modified short-time Fourier transform magnitude spectra[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2007, 15(5):1645-1653.
[22] SARKAR A K, MATROUF D, BOUSQUET P, et al. Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association. Portland, OR, USA:International Speech and Communication Association, 2012:2661-2664.
[23] WANG M G, SONG Y, JIANG B, et al. Exemplar based language recognition method for short-duration speech segments[C]//IEEE International Conference on Acoustics, Speech and Signal Processing. Vancouver, Canada:IEEE, 2013:7354-7358.
[24] CUMANI S, PLCHOT O, F'ER R. Exploiting i-vector posterior covariances for short-duration language recognition[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany:International Speech and Communication Association, 2015:1002-1006.
[25] LOZANO-DIEZ A, ZAZO-CANDⅡ R, GONZALEZ-DOMINGUEZ J, et al. An end-to-end approach to language identification in short utterances using convolutional neural networks[C]//Proceedings of the 16th Annual Conference of the International Speech Communication Association. Dresden, Germany:International Speech and Communication Association, 2015:403-407.
[26] TORRES-CARRASQUILLO P A, SINGER E, KOHLERR M A, et al. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features[C]//Proceedings of the 7th International Conference on Spoken Language Processing. Denver, Colorado, USA:International Speech Communication Association, 2002:89-92.
[27] CallFriend Corpus. Linguistic data consortium[S]. (1996) http://www.ldc.upenn/ldc/about/callfriend.html.
[28] MARTIN A F, LE A N. NIST 2007 language recognition evaluation[C]//Odyssey 2008:The Speaker and Language Recognition Workshop. Stellenbosch, South Africa:IEEE, 2008:16.
[29] 王宪亮, 吴志刚, 杨金超, 等. 基于SVM一对一分类的语种识别方法[J]. 清华大学学报(自然科学版), 2013, 53(6):808-812.WANG X L, WU Z G, YANG J C, et al. A language recognition method based on SVM one to one classification[J]. Journal of Tsinghua University (Science and Technology), 2013, 53(6):808-812. (in Chinese)

基金

国家重点研发计划重点专项(2016YFB0801203,2016YFB0801200);国家自然科学基金资助项目(11590770-4,U1536117,11504406,11461141004)

PDF(3208 KB)

Accesses

Citation

Detail

段落导航
相关文章

/