Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2017, Vol. 57 Issue (3): 250-256    DOI: 10.16511/j.cnki.qhdxxb.2017.26.005
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于BLSTM-RNN的语音驱动逼真面部动画合成
阳珊1, 樊博1, 谢磊1, 王丽娟2, 宋謌平2
1. 西北工业大学 计算机学院, 陕西省语音与图像处理重点实验室, 西安 710072;
2. 微软亚洲研究院, 北京 100080
Speech-driven video-realistic talking head synthesis using BLSTM-RNN
YANG Shan1, FAN Bo1, XIE Lei1, WANG Lijuan2, SONG Geping2
1. Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China;
2. Microsoft Research Asia, Beijing 100080, China
全文: PDF(1716 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 双向长短时记忆(bidirectional lorg short term memory,BLSTM)是一种特殊的递归神经网络(recurrent neural network,RNN),能够有效地对语音的长时上下文进行建模。该文提出一种基于深度BLSTM的语音驱动面部动画合成方法,利用说话人的音视频双模态信息训练BLSTM-RNN神经网络,采用主动外观模型(active appearance model,AAM)对人脸图像进行建模,将AAM模型参数作为网络输出,研究网络结构和不同语音特征输入对动画合成效果的影响。基于LIPS2008标准评测库的实验结果表明:具有BLSTM层的网络效果明显优于前向网络的,基于BLSTM-前向-BLSTM 256节点(BFB256)的三层模型结构的效果最佳,FBank、基频和能量组合可以进一步提升动画合成效果。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
阳珊
樊博
谢磊
王丽娟
宋謌平
关键词 虚拟说话人面部动画双向长短时记忆(BLSTM)递归神经网络(RNN)主动外观模型(AAM)    
Abstract:This paper describes a deep bidirectional long short term memory (BLSTM) approach for speech-driven photo-realistic talking head animations. Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. The deep BLSTM-RNN model is applied using a speaker's audio-visual bimodal data. The active appearance model (AAM) is used to model the facial movements with AAM parameters as the prediction targets of the neural network. This paper studies the impacts of different network architectures and acoustic features. Tests on the LIPS2008 audio-visual corpus show that networks with BLSTM layer(s) consistently outperform those having only feed-forward layers. The results show that the best network has a feed-forward layer inserted into two BLSTM layers with 256 nodes (BFB256) in the dataset. The combination of FBank, pitch and energy gives the best performance feature set for the speech-driven talking head animation task.
Key wordstalking avatar    facial animation    bidirectional long short term memory (BLSTM)    recurrent neural network (RNN)    active appearance model (AAM)
收稿日期: 2016-06-25      出版日期: 2017-03-15
ZTFLH:  TP391  
通讯作者: 谢磊,教授,E-mail:lxie@nwpu.edu.cn     E-mail: lxie@nwpu.edu.cn
引用本文:   
阳珊, 樊博, 谢磊, 王丽娟, 宋謌平. 基于BLSTM-RNN的语音驱动逼真面部动画合成[J]. 清华大学学报(自然科学版), 2017, 57(3): 250-256.
YANG Shan, FAN Bo, XIE Lei, WANG Lijuan, SONG Geping. Speech-driven video-realistic talking head synthesis using BLSTM-RNN. Journal of Tsinghua University(Science and Technology), 2017, 57(3): 250-256.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2017.26.005  或          http://jst.tsinghuajournals.com/CN/Y2017/V57/I3/250
  图1 基于深度神经网络的面部动画合成系统框架
  图2 面部特征点与三角剖分
  图3 LSTM 内存单元
  图4 面部动画合成的DBLSTMGRNN 模型
  表1 各种网络拓扑结构的客观测评结果
  图5 第一维预测AAM 参数和真实AAM 参数
  表2 BFB256网络下各种特征的客观测评结果
[1] XIE Lei, SUN Naicai, FAN Bo. A statistical parametric approach to video-realistic text-driven talking avatar[J]. Multimedia Tools and Applications, 2014, 73(1):377-396.
[2] Berger M A, Hofer G, Shimodaira H. Carnival-combining speech technology and computer animation[J]. Computer Graphics and Applications, IEEE, 2011, 31(5):80-89.
[3] YANG Minghao, TAO Jianhua, MU Kaihui, et al. A multimodal approach of generating 3D human-like talking agent[J]. Journal on Multimodal User Interfaces, 2012, 5(1-2):61-68.
[4] Bregler C, Covell M, Slaney M. Video rewrite:Driving visual speech with audio[C]//Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. Los Angeles, CA, USA:ACM Press, 1997:353-360.
[5] Huang F J, Cosatto E, Graf H P. Triphone based unit selection for concatenative visual speech synthesis[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando, FL, USA:IEEE, 2002:2037-2040.
[6] Ezzat T, Geiger G, Poggio T. Trainable videorealistic speech animation[J]. Acm Transactions on Graphics, 2004, 3(3):57-64.
[7] TAO Jianhua, YIN Panrong. Speech driven face animation based on dynamic concatenation model[J]. J Inf Computat Sci, 2007, 4(1):271-280.
[8] JIA Jia, WU Zhiyong, ZHANG Shen, et al. Head and facial gestures synthesis using PAD model for an expressive talking avatar[J]. Multimedia Tools and Applications, 2014, 73(1):439-461.
[9] ZHAO Kai, WU Zhiyong, JIA Jia, et al. An online speech driven talking head system[C]//Proceedings of the Global High Tech Congress on Electronics. Shenzhen, China:IEEE Press, 2012:186-187.
[10] Sako S, Tokuda K, Masuko T, et al. HMM-based text-to-audio-visual speech synthesis[C]//Proceedings of the International Conference on Spoken Language Processing. Beijing, China:IEEE Press, 2000:25-28
[11] Eddy S R. Hidden markov models[J]. Current Opinion in Structural Biology, 1996, 6(3):361-365.
[12] WANG Lijuan, QIAN Xiaojun, HAN Wei et al. Synthesizing photo-real talking head via trajectory-guided sample selection[C]//Proceedings of the International Speech Communication Association. Makuhari, Japan:IEEE Press, 2010:446-449.
[13] Ze H, Senior A, Schuster M. Statistical parametric speech synthesis using deep neural networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada:IEEE Press, 2013:7962-7966.
[14] Hinton G, DENG Li, YU Dong, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. Signal Processing Magazine, IEEE, 2012, 29(6):82-97.
[15] FAN Yuchen, QIAN Yao, XIE Fenglong, et al. TTS synthesis with bidirectional LSTM based recurrent neural networks[C]//Proceedings of the International Speech Communication Association. Singapore:IEEE Press, 2014:1964-1968.
[16] Kang S Y, Qian X J, Meng H. Multi-distribution deep belief network for speech synthesis[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada:IEEE Press, 2013:8012-8016.
[17] Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 1997, 45(11):2673-2681.
[18] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8):1735-1780.
[19] FAN Bo, WANG Lijuan, Song F K, et al. Photo-real talking head with deep bidirectional LSTM[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Brisbane, Australia:IEEE Press, 2015:4884-4888.
[20] Cootes T F, Edwards G J, Taylor C J. Active appearance models[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(6):681-685.
[21] Werbos P J. Backpropagation through time:What it does and how to do it[J]. Proceedings of the IEEE, 1990, 78(10):1550-1560.
[22] Williams R J, Zipser D. Gradient-based learning algorithms for recurrent networks and their computational complexity[J]. Back-propagation:Theory, Architectures and Applications, 1995:433-486.
[23] Pérez P, Gangnet M, Blake A. Poisson image editing[C]//Proceedings of the ACM Transactions on Graphics. New York, NY, USA:ACM, 2003:313-318.
[24] WANG Qiang, ZHANG Weiwei, TANG Xiaoou, et al. Real-time bayesian 3-D pose tracking[J]. Circuits and Systems for Video Technology, IEEE Transactions on, 2006, 16(12):1533-1541.
[25] Jolliffe I T. Principal component analysis[J]. Springer Berlin, 1986, 87(100):41-64.
[26] Stegmann M B. Active appearance models:Theory extensions and cases[J]. Informatics & Mathematical Modelling, 2000, 1(6):748-754.
[27] Roweis S. EM algorithms for PCA and SPCA[J]. Advances in Neural Information Processing Systems, 1999, 10:626-632.
[28] Cootes T F, Kittipanya-ngam P. Comparing variations on the active appearance model algorithm[C]//Proceedings of the 13th British Machine Vision Conference. Cardiff, Wales, UK:BMVA, 2002:1-10.
[29] Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada:IEEE Press, 2013:6645-6649.
[30] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[31] Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. Signal Processing, IEEE Transactions on, 1997, 45(11):2673-2681.
[32] Theobald B J, Fagel S, Bailly G, et al. LIPS2008:Visual speech synthesis challenge[C]//Proceedings of the International Speech Communication Association. Brisbane, Australia:IEEE Press, 2008:2310-2313.
[33] Young S, Evermann G, Gales M, et al. The HTK book[M]. Cambridge:Cambridge University Engineering Department, 2002."
[1] 朱武祥, 廖静秋, 詹子良, 谭智佳. 回归金融原理: 企业财务危机预警研究述评与展望[J]. 清华大学学报(自然科学版), 2023, 63(9): 1467-1482.
[2] 努尔麦麦提·尤鲁瓦斯, 刘俊华, 吾守尔·斯拉木, 热依曼·吐尔逊, 达吾勒·阿布都哈依尔. 跨语言声学模型在维吾尔语语音识别中的应用[J]. 清华大学学报(自然科学版), 2018, 58(4): 342-346.
[3] 艾斯卡尔·肉孜, 王东, 李蓝天, 郑方, 张晓东, 金磐石. 说话人识别中的分数域语速归一化[J]. 清华大学学报(自然科学版), 2018, 58(4): 337-341.
[4] 沈方瑶, 戴国骏, 代成雷, 郭鸿杰, 张桦. 基于特征关联模型的广告点击率预测[J]. 清华大学学报(自然科学版), 2018, 58(4): 374-379.
[5] 郑军, 李文庆. 基于双PSD的三维测量系统的标定方法[J]. 清华大学学报(自然科学版), 2018, 58(4): 411-416.
[6] 张婧, 黄德根, 黄锴宇, 刘壮, 孟祥主. 基于λ-主动学习方法的中文微博分词[J]. 清华大学学报(自然科学版), 2018, 58(3): 260-265.
[7] 王元龙, 李茹, 张虎, 王智强. 阅读理解中因果关系类选项的研究[J]. 清华大学学报(自然科学版), 2018, 58(3): 272-278.
[8] 杨倩文, 孙富春. 基于泛化空间正则自动编码器的遥感图像识别[J]. 清华大学学报(自然科学版), 2018, 58(2): 113-121.
[9] 张仰森, 郑佳, 黄改娟, 蒋玉茹. 基于双重注意力模型的微博情感分析方法[J]. 清华大学学报(自然科学版), 2018, 58(2): 122-130.
[10] 易江燕, 陶建华, 刘斌, 温正棋. 基于迁移学习的噪声鲁棒语音识别声学建模[J]. 清华大学学报(自然科学版), 2018, 58(1): 55-60.
[11] 哈妮克孜·伊拉洪, 古力米热·依玛木, 玛依努尔·阿吾力提甫, 姑丽加玛丽·麦麦提艾力, 艾斯卡尔·艾木都拉. 维吾尔语感叹句语调起伏度[J]. 清华大学学报(自然科学版), 2017, 57(12): 1254-1258.
[12] 曾翔, 杨哲飚, 许镇, 陆新征. 村镇建筑群火灾蔓延模拟与案例[J]. 清华大学学报(自然科学版), 2017, 57(12): 1331-1337.
[13] 田川, 叶晓俊, 王祖良, 李鑫. 血液管理RFID多标签识别碰撞避免方法[J]. 清华大学学报(自然科学版), 2017, 57(11): 1121-1126.
[14] 路文焕, 曲悦欣, 杨亚龙, 王建荣, 党建武. 无声语音接口中超声图像的混合特征提取[J]. 清华大学学报(自然科学版), 2017, 57(11): 1159-1162,1169.
[15] 张敏, 丁弼原, 马为之, 谭云志, 刘奕群, 马少平. 基于深度学习加强的混合推荐方法[J]. 清华大学学报(自然科学版), 2017, 57(10): 1014-1021.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn