面向情感语音合成的言语情感描述与预测

doi:10.16511/j.cnki.qhdxxb.2017.22.015

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF(1161 KB)
输出: BibTeX | EndNote (RIS)

摘要针对情感语音合成系统中情感的细腻刻画与自动预测问题，提出多视角情感描述模型，从认知评价、心理感受、生理反应和发音方式4个方面刻画言语情感的产生过程和衍化机制；引入能够支持分布式特征且具有堆叠结构的多层神经网络——深层堆叠网络构建从文本到情感描述的预测模型。实验结果表明在预测模型中引入不同情感成分和上下文信息作为特征有助于提升预测效果，验证了采用深层堆叠网络进行情感预测的有效性与多视角情感描述模型的合理性。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	高莹莹
	朱维彬

关键词 ：语音合成, 情感描述, 文本情感预测, 深层神经网络

Abstract：A multi-perspective emotion model is presented to provide more details about the emotions in expressive speech synthesis and to facilitate automatic predictions. The method describes the emotion development in terms of the cognitive appraisal, psychological feeling, physical response and utterance manner. The descriptive model is used to develop a text-based emotion prediction model using a deep neural network (the deep stacking network), which supports distributed representation and has a stacking structure. Tests validate the benefits of using this prediction method for the interactions among different emotional aspects and the contextual impacts, as well as the effectiveness of the deep stacking network and the multi-perspective emotion model.

Key words： speech synthesis emotion description text-based emotion prediction deep neural network

收稿日期: 2016-06-29 出版日期: 2017-02-15

ZTFLH:	TN912.33
	TP391.1
	TP183

通讯作者: 朱维彬,副教授,E-mail:wbzhu@bjtu.edu.cn E-mail: wbzhu@bjtu.edu.cn

引用本文:

高莹莹, 朱维彬. 面向情感语音合成的言语情感描述与预测[J]. 清华大学学报（自然科学版）, 2017, 57(2): 202-207.
GAO Yingying, ZHU Weibin. Describing and predicting affective messages for expressive speech synthesis. Journal of Tsinghua University(Science and Technology), 2017, 57(2): 202-207.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2017.22.015 或 http://jst.tsinghuajournals.com/CN/Y2017/V57/I2/202

图1 言语情感产生过程示意图^[9]

图2 多视角情感描述模型^[9]

图3 多尺度文本情感预测模型

图4 深层堆叠网络模块结构和连接关系示意图

表1 加入不同情感成分的预测结果

表2 加入篇章级和段落级情感信息的情感预测结果

表3 加入前一句情感信息的情感预测结果

[1]	Govind D, Prasanna S R M. Expressive speech synthesis:A review[J]. International Journal of Speech Technology, 2013, 16(2):237-260.
[2]	徐俊, 蔡莲红. 面向情感转换的层次化韵律分析与建模[J]. 清华大学学报:自然科学版, 2009, 49(S1):1274-1277.XU Jun, CAI Lianhong. Hierarchical prosody analysis and modeling for emotional conversions[J]. J Tsinghua Univ:Sci & Tech, 2009, 49(S1):1274-1277. (in Chinese)
[3]	TAO Jianhua, KANG Yongguo, LI Aijun. Prosody conversion from neutral speech to emotional speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2006, 14(4):1145-1154.
[4]	韩纪庆, 邵艳秋. 基于语音信号的情感处理研究进展[J]. 电声技术, 2006(5):58-62.HAN Jiqing, SHAO Yanqiu. Research progress of emotion processing based on speech signal[J]. Audio Engineering, 2006(5):58-62. (in Chinese)
[5]	Ekman P, Friesen W V, O'Sullivan M, et al. Universals and cultural differences in the judgments of facial s of emotion[J]. Journal of Personality and Social Psychology, 1987, 53(4):712-717.
[6]	Cowie R, Douglas-Cowie E, Savvidou S, et al. FEELTRACE:An instrument for recording perceived emotion in real time[C]//ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion. Newcastle, UK, 2000:19-24.
[7]	Mehrabian A. Pleasure-arousal-dominance:A general framework for describing and measuring individual differences in temperament[J]. Current Psychology, 1996, 14(4):261-292.
[8]	Moors A, Ellsworth P C, Scherer K R, et al. Appraisal theories of emotion:State of the art and future development[J]. Emotion Review, 2013, 5(2):119-124.
[9]	高莹莹, 朱维彬. 言语情感描述体系的试验性研究[J]. 中国语音学报, 2013, 4:71-81.GAO Yingying, ZHU Weibin. The research for the description system of speech emotion[J]. Chinese Journal of Phonetics, 2013, 4:71-81. (in Chinese)
[10]	DENG Li, YU Dong, Platt J. Scalable stacking and learning for building deep architectures[C]//IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Kyoto, Japan, 2012:2133-2136.
[11]	Riedl M, Biemann C. Text segmentation with topic models[J]. Journal for Language Technology and Computational Linguistics, 2012, 27(1):47-69.
[12]	Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. The Journal of Machine Learning Research, 2003, 3:993-1022.
[13]	Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks[J]. Science, 2006, 313(5786):504-507.
[14]	Hinton G. A practical guide to training restricted Boltzmann machines[J]. Momentum, 2010, 9(1):599-619.
[15]	YU Dong, DENG Li. Accelerated parallelizable neural network learning algorithm for speech recognition[C]//12th Annual Conference of the International Speech Communication Association (INTERSPEECH). Florence, Italy:ISCA Press, 2011:2281-2284.

[1]	傅睿博, 陶建华, 李雅, 温正棋. 基于静音时长和文本特征融合的韵律边界自动标注[J]. 清华大学学报（自然科学版）, 2018, 58(1): 61-66,74.
[2]	解焱陆, 张蓓, 张劲松. 基于音高映射合成语音的汉语双字调声调训练[J]. 清华大学学报（自然科学版）, 2017, 57(2): 170-175.
[3]	古力米热·依玛木, 姑丽加玛丽·麦麦提艾力, 玛依努尔·阿吾力提甫, 艾斯卡尔·艾木都拉. 维吾尔语韵律建模[J]. 清华大学学报（自然科学版）, 2017, 57(12): 1259-1264.
[4]	邢安昊, 张鹏远, 潘接林, 颜永红. 基于SVD的DNN裁剪方法和重训练[J]. 清华大学学报（自然科学版）, 2016, 56(7): 772-776.

Viewed

Full text

Abstract

Cited

Shared

Discussed