Automatic prosodic boundary labeling based on fusing the silence duration with the lexical features
FU Ruibo1,2, TAO Jianhua1,2,3, LI Ya1, WEN Zhengqi1
1. National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China;
2. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100190, China;
3. CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Abstract：Automatic prosodic boundary labeling is important in the construction of a speech corpus for speech synthesis. Automatic labeling of prosodic boundaries gives more consistent results than manual labeling of prosodic boundaries which is time consuming and inconsistent. Manual labeling method is modelled here using a recurrent neural network to train two sub-models which use lexical features and acoustic features to label the prosodic boundaries. Model fusion is then used to combine the outputs of the two sub-models to obtain the optimal labeling results. The silence durations for each word give clearer physical meanings and better correlations with the prosodic boundaries than the acoustic features used in traditional methods extracted frame-by-frame. Tests show that the silence durations extracted using the current acoustic features and the model fusion method improve the prosodic boundary labeling compared with previous feature fusion methods.
傅睿博, 陶建华, 李雅, 温正棋. 基于静音时长和文本特征融合的韵律边界自动标注[J]. 清华大学学报（自然科学版）, 2018, 58(1): 61-66,74.
FU Ruibo, TAO Jianhua, LI Ya, WEN Zhengqi. Automatic prosodic boundary labeling based on fusing the silence duration with the lexical features. Journal of Tsinghua University(Science and Technology), 2018, 58(1): 61-66,74.
CHU M, QIAN Y. Locating boundaries for prosodic constituents in unrestricted Mandarin texts[J]. Computational Linguistics and Chinese Language Processing, 2001, 6(1):61-82.
WANG M Q, HIRSCHBERG J. Automatic classification of intonational phrase boundaries[J]. Computer Speech & Language, 1992, 6(2):175-196.
LEVOW G A. Automatic prosodic labeling with conditional random fields and rich acoustic features[C]//International Joint Conference on Natural Language Processing (IJCNLP). Hyderabad, India:2008:217-224.
ROSENBERG A, FERNANDEZ R, RAMABHADRAN B. Modeling phrasing and prominence using deep recurrent learning[C]//Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany, 2015:136-141.
BUSSER B, DAELEMANS W, BOSCH A. Predicting phrase breaks with memory-based learning[C]//4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis. Edinburgh, UK:University of Edinburgh, 2001:29-34.
WIGHTMAN C W, OSTENDORF M. Automatic labeling of prosodic patterns[J]. IEEE Transactions on Speech and Audio Processing, 1994, 2(4):469-481.
HASEGAWA-JOHNSON M, CHEN K, COLE J, et al. Simultaneous recognition of words and prosody in the boston university radio speech corpus[J]. Speech Communication, 2005, 46(3):418-439.
CHEN Q, LING Z H, YANG C Y, et al. Automatic phrase boundary labeling of speech synthesis database using context-dependent HMMs and N-Gram prior distributions[C]//Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH). Dresden, Germany, 2015:227-234.
DING C, XIE L, YAN J, et al. Automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features[C]//Automatic Speech Recognition and Understanding (ASRU). Scottsdale, USA, 2015:98-102.
LIN C K, LEE L S. Improved spontaneous Mandarin speech recognition by disfluency interruption point (IP) detection using prosodic features[C]//Ninth European Conference on Speech Communication and Technology. Lisbon, Portuguese, 2005:78-85.
TIELEMAN T, HINTON G. Lecture 6.6-Rmsprop:Divide the gradient by a running average of its recent magnitude[Z/OL].[2017-01-01]. https://www.coursera.org/learn/neural-networks.
HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al. Improving neural networks by preventing co-adaptation of feature detectors[J]. Computer Science, 2012, 3(4):212-223.