Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2018, Vol. 58 Issue (1): 61-66,74    DOI: 10.16511/j.cnki.qhdxxb.2018.21.003
  自动化 本期目录 | 过刊浏览 | 高级检索 |
基于静音时长和文本特征融合的韵律边界自动标注
傅睿博1,2, 陶建华1,2,3, 李雅1, 温正棋1
1. 中国科学院 自动化研究所, 模式识别国家重点实验室, 北京 100190;
2. 中国科学院大学 人工智能技术学院, 北京 100190;
3. 中国科学院 自动化研究所, 中国科学院脑科学与智能技术研究中心, 北京 100190
Automatic prosodic boundary labeling based on fusing the silence duration with the lexical features
FU Ruibo1,2, TAO Jianhua1,2,3, LI Ya1, WEN Zhengqi1
1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;
2. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100190, China;
3. CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF(1161 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 韵律边界标注对于语料库建设和语音合成有着至关重要的作用,而自动韵律标注可以克服人工标注中耗时、不一致的缺点。仿照人工标注流程,该文运用循环神经网络分别对文本和音频两个通道训练子模型,对子模型的输出采用模型融合的方法,从而获得最优标注。以词为单位提取了静音时长,与传统以帧为单位的声学特征相比更具有明确的物理意义,与韵律边界的联系更加紧密。实验结果表明:相比传统声学特征,该文所采用的静音时长特征使自动韵律标注的性能有所提高;相比直接特征层面的方法,决策融合方法更好地结合了声学和文本的特征,进一步提高了标注的性能。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
傅睿博
陶建华
李雅
温正棋
关键词 韵律边界标注决策融合静音时长语料库构建语音合成    
Abstract:Automatic prosodic boundary labeling is important in the construction of a speech corpus for speech synthesis. Automatic labeling of prosodic boundaries gives more consistent results than manual labeling of prosodic boundaries which is time consuming and inconsistent. Manual labeling method is modelled here using a recurrent neural network to train two sub-models which use lexical features and acoustic features to label the prosodic boundaries. Model fusion is then used to combine the outputs of the two sub-models to obtain the optimal labeling results. The silence durations for each word give clearer physical meanings and better correlations with the prosodic boundaries than the acoustic features used in traditional methods extracted frame-by-frame. Tests show that the silence durations extracted using the current acoustic features and the model fusion method improve the prosodic boundary labeling compared with previous feature fusion methods.
Key wordsprosodic boundary labeling    ensemble strategy    silence duration    corpus construction    speech synthesis
收稿日期: 2017-09-29      出版日期: 2018-01-15
ZTFLH:  H116.4  
  TP181  
通讯作者: 陶建华,研究员,E-mail:jhtao@nlpr.ia.ac     E-mail: jhtao@nlpr.ia.ac
引用本文:   
傅睿博, 陶建华, 李雅, 温正棋. 基于静音时长和文本特征融合的韵律边界自动标注[J]. 清华大学学报(自然科学版), 2018, 58(1): 61-66,74.
FU Ruibo, TAO Jianhua, LI Ya, WEN Zhengqi. Automatic prosodic boundary labeling based on fusing the silence duration with the lexical features. Journal of Tsinghua University(Science and Technology), 2018, 58(1): 61-66,74.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2018.21.003  或          http://jst.tsinghuajournals.com/CN/Y2018/V58/I1/61
  图 1 整体系统框架
  图 2 韵律标注各环节类比
  图 3 整体静音时长特征提取流程
  表 1 各特征在一维线性分类器中评价结果(韵律短语)
  表2 实验中的超参数设置
  表 3 韵律自动标注F 评价结果
[1] CHU M, QIAN Y. Locating boundaries for prosodic constituents in unrestricted Mandarin texts[J]. Computational Linguistics and Chinese Language Processing, 2001, 6(1):61-82.
[2] WANG M Q, HIRSCHBERG J. Automatic classification of intonational phrase boundaries[J]. Computer Speech & Language, 1992, 6(2):175-196.
[3] LEVOW G A. Automatic prosodic labeling with conditional random fields and rich acoustic features[C]//International Joint Conference on Natural Language Processing (IJCNLP). Hyderabad, India:2008:217-224.
[4] ROSENBERG A, FERNANDEZ R, RAMABHADRAN B. Modeling phrasing and prominence using deep recurrent learning[C]//Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany, 2015:136-141.
[5] BUSSER B, DAELEMANS W, BOSCH A. Predicting phrase breaks with memory-based learning[C]//4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis. Edinburgh, UK:University of Edinburgh, 2001:29-34.
[6] WIGHTMAN C W, OSTENDORF M. Automatic labeling of prosodic patterns[J]. IEEE Transactions on Speech and Audio Processing, 1994, 2(4):469-481.
[7] HASEGAWA-JOHNSON M, CHEN K, COLE J, et al. Simultaneous recognition of words and prosody in the boston university radio speech corpus[J]. Speech Communication, 2005, 46(3):418-439.
[8] CHEN Q, LING Z H, YANG C Y, et al. Automatic phrase boundary labeling of speech synthesis database using context-dependent HMMs and N-Gram prior distributions[C]//Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany, 2015:227-234.
[9] DING C, XIE L, YAN J, et al. Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features[C]//Automatic Speech Recognition and Understanding (ASRU). Scottsdale, USA, 2015:98-102.
[10] LIN C K, LEE L S. Improved spontaneous Mandarin speech recognition by disfluency interruption point (IP) detection using prosodic features[C]//Ninth European Conference on Speech Communication and Technology. Lisbon, Portuguese, 2005:78-85.
[11] TIELEMAN T, HINTON G. Lecture 6.6-Rmsprop:Divide the gradient by a running average of its recent magnitude[Z/OL].[2017-01-01]. https://www.coursera.org/learn/neural-networks.
[12] HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. Computer Science, 2012, 3(4):212-223.
[1] 解焱陆, 张蓓, 张劲松. 基于音高映射合成语音的汉语双字调声调训练[J]. 清华大学学报(自然科学版), 2017, 57(2): 170-175.
[2] 高莹莹, 朱维彬. 面向情感语音合成的言语情感描述与预测[J]. 清华大学学报(自然科学版), 2017, 57(2): 202-207.
[3] 古力米热·依玛木, 姑丽加玛丽·麦麦提艾力, 玛依努尔·阿吾力提甫, 艾斯卡尔·艾木都拉. 维吾尔语韵律建模[J]. 清华大学学报(自然科学版), 2017, 57(12): 1259-1264.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn