Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2018, Vol. 58 Issue (4): 337-341    DOI: 10.16511/j.cnki.qhdxxb.2018.25.028
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
说话人识别中的分数域语速归一化
艾斯卡尔·肉孜1, 王东1, 李蓝天1, 郑方1, 张晓东2, 金磐石2
1. 清华大学 计算机科学与技术系, 清华信息科学技术国家实验室技术创新和开发部语音和语言技术中心, 信息技术研究院语音和语言技术中心, 北京 100084;
2. 中国建设银行 信息技术管理部, 北京 100000
Score domain speaking rate normalization for speaker recognition
AISIKAER Rouzi1, WANG Dong1, LI Lantian1, ZHENG Fang1, ZHANG Xiaodong2, JIN Panshi2
1. Center for Speech and Language Technologies, Division of Technical Innovation and Development, Tsinghua National Laboratory for Information Science and Technology;Center for Speech and Language Technologies, Research Institute of Information Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
2. Information Technology Management Department, China Construction Bank, Beijing 100000, China
全文: PDF(978 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 语速变化导致说话人识别系统性能显著下降。该文提出一种分数域语速归一化方法来降低语速变化对说话人识别系统的影响。由不同语速语音数据组成参考集(全局和局部),对每一个登入说话人估计其对参考集中每一类参考语音的分数分布,局部参考集通过按相对语速划分全局参考集而获得。基于该文录制的语速数据库在GMM-UBM(Gaussian mixture model-universal background model)框架下对测试语音进行分数归一化,并通过训练数据扩展有效解决了数据系数问题,最终等错误率相对下降33.33%。研究结果表明:全局和局部归一化方法都有效减少了语速变化对说话人识别系统的影响。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
关键词 说话人识别分数域语速归一化相对语速GMM-UBM    
Abstract:Speaking rate variations seriously degrade speaker recognition accuracy. This paper presents a normalization approach in the score domain that reduces the impact of speaking rate variations. The score distributions for each type of imposter in the cohort set (global and local sets which consist of speech utterances at different speaking rates) are computed against each enrolled speaker with the local cohort set obtained by splitting the utterances in the global cohort set according to the relative speaking rates. The scores for the test speech are normalized based on a self-recorded speaking rate database using a GMM-UBM (Gaussian mixture model-universal background model) framework with the data sparsity problem handled by augmenting the training data with a final relative EER (equal error rate) reduction of 33.33%. This study shows that global and local score normalization methods effectively reduce the impact of speaking rate variations on speaker recognition.
Key wordsspeaker recognition    score domain    speaking rate normalization    relative speaking rate    GMM-UBM
收稿日期: 2017-09-29      出版日期: 2018-04-15
ZTFLH:  TP391.4  
基金资助:国家自然科学基金资助项目(61271389,61371136);国家“九七三”重点基础研究发展计划(2013CB329302)
通讯作者: 郑方,教授,E-mail:zheng@tsinghua.edu.cn     E-mail: zheng@tsinghua.edu.cn
作者简介: 艾斯卡尔·肉孜(1978-),男,博士研究生。
引用本文:   
艾斯卡尔·肉孜, 王东, 李蓝天, 郑方, 张晓东, 金磐石. 说话人识别中的分数域语速归一化[J]. 清华大学学报(自然科学版), 2018, 58(4): 337-341.
AISIKAER Rouzi, WANG Dong, LI Lantian, ZHENG Fang, ZHANG Xiaodong, JIN Panshi. Score domain speaking rate normalization for speaker recognition. Journal of Tsinghua University(Science and Technology), 2018, 58(4): 337-341.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.25.028  或          http://jst.tsinghuajournals.com/CN/Y2018/V58/I4/337
  图1 不同语速语音声学特征对比图
  图2 语速与等错误率关系图
  表1 基线系统识别结果
  表2 语速归一化方法与基线系统对比
  表3 基于扩展的参考集合的归一化结果
[1] CAMPBELL W M, CAMPBELL J P, REYNOLDS D A, et al. Support vector machines for speaker and language recognition[J]. Computer Speech & Language, 2006, 20(2):210-229.
[2] BIMBOT F, BONASTRE J F, FREDOUILLE C. A tutorial on text-independent speaker verification[J]. EURASIP Journal on Applied Signal Processing, 2004(1):430-451.
[3] CHU M S, POVEY D. Speaking rate adaptation using continuous frame rate normalization[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Dallas, TX,[1] CAMPBELL W M, CAMPBELL J P, REYNOLDS D A, et al. Support vector machines for speaker and language recognition[J]. Computer Speech & Language, 2006, 20(2):210-229.
[2] BIMBOT F, BONASTRE J F, FREDOUILLE C. A tutorial on text-independent speaker verification[J]. EURASIP Journal on Applied Signal Processing, 2004(1):430-451.
[3] CHU M S, POVEY D. Speaking rate adaptation using continuous frame rate normalization[C]//Proceedings of the International Conference on Acoustics, Speech and Signal Processing. Dallas, TX, USA:IEEE, 2010:4306-4309.
[4] XU M X, ZHANG L P, WANG L L. Database collection for study on speech variation robust speaker recognition[C]//Proceedings of the Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques. Kyoto, Japan:IEEE, 2008.
[5] MARCO G, CUMMINS F. Speech style and speaker recognition:A case study[C]//Proceedings of the Interspeech. Brighton, UK:IEEE, 2009.
[6] ASKAR R, LI L T, WANG D, et al. Feature transformation for speaker verification under speaking rate mismatch condition[C]//Proceedings of the Asia-Pacific Signal and Information Processing Association. Jeju, Korea:IEEE, 2016.
[7] VAN HEERDEN C J, BARNARD E. Speech rate normalization used to improve speaker verification[J]. SAIEE Africa Research Journal, 2007, 98(4):129-135.
[8] BEIGI H. Fundamentals of speaker recognition[M]. New York, USA:Springer, 2011.
[9] MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9:2579-2605.
[10] van der MAATEN L, HINTON G. Visualizing non-metric similarities in multiple maps[J]. Machine Learning, 2012, 87(1):33-55.
[11] CUMMINS F, GRIMALDI M, LEONARD T, et al. The chains corpus:Characterizing individual speakers[C]//Proceedings of the International Conference on Speech and Computer (SPECOM), St. Petersburg, Russia:Springer, 2006:431-435.
[12] POVEY D, GHOSHAL A, BOULIANNE G, et al. The KALDI speech recognition toolkit[C]//Proceedings of the Automatic Speech Recognition and Understanding (ASRU). Hawaii, HI, USA:IEEE, 2011. USA:IEEE, 2010:4306-4309.
[4] XU M X, ZHANG L P, WANG L L. Database collection for study on speech variation robust speaker recognition[C]//Proceedings of the Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques. Kyoto, Japan:IEEE, 2008.
[5] MARCO G, CUMMINS F. Speech style and speaker recognition:A case study[C]//Proceedings of the Interspeech. Brighton, UK:IEEE, 2009.
[6] ASKAR R, LI L T, WANG D, et al. Feature transformation for speaker verification under speaking rate mismatch condition[C]//Proceedings of the Asia-Pacific Signal and Information Processing Association. Jeju, Korea:IEEE, 2016.
[7] VAN HEERDEN C J, BARNARD E. Speech rate normalization used to improve speaker verification[J]. SAIEE Africa Research Journal, 2007, 98(4):129-135.
[8] BEIGI H. Fundamentals of speaker recognition[M]. New York, USA:Springer, 2011.
[9] MAATEN L, HINTON G. Visualizing data using t-SNE[J]. Journal of Machine Learning Research, 2008, 9:2579-2605.
[10] van der MAATEN L, HINTON G. Visualizing non-metric similarities in multiple maps[J]. Machine Learning, 2012, 87(1):33-55.
[11] CUMMINS F, GRIMALDI M, LEONARD T, et al. The chains corpus:Characterizing individual speakers[C]//Proceedings of the International Conference on Speech and Computer (SPECOM), St. Petersburg, Russia:Springer, 2006:431-435.
[12] POVEY D, GHOSHAL A, BOULIANNE G, et al. The KALDI speech recognition toolkit[C]//Proceedings of the Automatic Speech Recognition and Understanding (ASRU). Hawaii, HI, USA:IEEE, 2011.
[1] 杨莹春, 邓立才. 基于GMM托肯配比相似度校正得分的说话人识别[J]. 清华大学学报(自然科学版), 2017, 57(1): 28-32.
[2] 田垚, 蔡猛, 何亮, 刘加. 基于深度神经网络和Bottleneck特征的说话人识别系统[J]. 清华大学学报(自然科学版), 2016, 56(11): 1143-1148.
[3] 郭武, 马啸空. 复杂噪声场景下的活动语音检测方法[J]. 清华大学学报(自然科学版), 2016, 56(11): 1190-1195.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn