Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  百年期刊
Journal of Tsinghua University(Science and Technology)    2017, Vol. 57 Issue (3) : 250-256     DOI: 10.16511/j.cnki.qhdxxb.2017.26.005
COMPUTER SCIENCE AND TECHNOLOGY |
Speech-driven video-realistic talking head synthesis using BLSTM-RNN
YANG Shan1, FAN Bo1, XIE Lei1, WANG Lijuan2, SONG Geping2
1. Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China;
2. Microsoft Research Asia, Beijing 100080, China
Download: PDF(1716 KB)  
Export: BibTeX | EndNote | Reference Manager | ProCite | RefWorks    
Abstract  This paper describes a deep bidirectional long short term memory (BLSTM) approach for speech-driven photo-realistic talking head animations. Long short-term memory (LSTM) is a recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. The deep BLSTM-RNN model is applied using a speaker's audio-visual bimodal data. The active appearance model (AAM) is used to model the facial movements with AAM parameters as the prediction targets of the neural network. This paper studies the impacts of different network architectures and acoustic features. Tests on the LIPS2008 audio-visual corpus show that networks with BLSTM layer(s) consistently outperform those having only feed-forward layers. The results show that the best network has a feed-forward layer inserted into two BLSTM layers with 256 nodes (BFB256) in the dataset. The combination of FBank, pitch and energy gives the best performance feature set for the speech-driven talking head animation task.
Keywords talking avatar      facial animation      bidirectional long short term memory (BLSTM)      recurrent neural network (RNN)      active appearance model (AAM)     
ZTFLH:  TP391  
Issue Date: 15 March 2017
Service
E-mail this article
E-mail Alert
RSS
Articles by authors
YANG Shan
FAN Bo
XIE Lei
WANG Lijuan
SONG Geping
Cite this article:   
YANG Shan,FAN Bo,XIE Lei, et al. Speech-driven video-realistic talking head synthesis using BLSTM-RNN[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(3): 250-256.
URL:  
http://jst.tsinghuajournals.com/EN/10.16511/j.cnki.qhdxxb.2017.26.005     OR     http://jst.tsinghuajournals.com/EN/Y2017/V57/I3/250
  
  
  
  
  
  
  
[1] XIE Lei, SUN Naicai, FAN Bo. A statistical parametric approach to video-realistic text-driven talking avatar[J]. Multimedia Tools and Applications, 2014, 73(1):377-396.
[2] Berger M A, Hofer G, Shimodaira H. Carnival-combining speech technology and computer animation[J]. Computer Graphics and Applications, IEEE, 2011, 31(5):80-89.
[3] YANG Minghao, TAO Jianhua, MU Kaihui, et al. A multimodal approach of generating 3D human-like talking agent[J]. Journal on Multimodal User Interfaces, 2012, 5(1-2):61-68.
[4] Bregler C, Covell M, Slaney M. Video rewrite:Driving visual speech with audio[C]//Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. Los Angeles, CA, USA:ACM Press, 1997:353-360.
[5] Huang F J, Cosatto E, Graf H P. Triphone based unit selection for concatenative visual speech synthesis[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Orlando, FL, USA:IEEE, 2002:2037-2040.
[6] Ezzat T, Geiger G, Poggio T. Trainable videorealistic speech animation[J]. Acm Transactions on Graphics, 2004, 3(3):57-64.
[7] TAO Jianhua, YIN Panrong. Speech driven face animation based on dynamic concatenation model[J]. J Inf Computat Sci, 2007, 4(1):271-280.
[8] JIA Jia, WU Zhiyong, ZHANG Shen, et al. Head and facial gestures synthesis using PAD model for an expressive talking avatar[J]. Multimedia Tools and Applications, 2014, 73(1):439-461.
[9] ZHAO Kai, WU Zhiyong, JIA Jia, et al. An online speech driven talking head system[C]//Proceedings of the Global High Tech Congress on Electronics. Shenzhen, China:IEEE Press, 2012:186-187.
[10] Sako S, Tokuda K, Masuko T, et al. HMM-based text-to-audio-visual speech synthesis[C]//Proceedings of the International Conference on Spoken Language Processing. Beijing, China:IEEE Press, 2000:25-28
[11] Eddy S R. Hidden markov models[J]. Current Opinion in Structural Biology, 1996, 6(3):361-365.
[12] WANG Lijuan, QIAN Xiaojun, HAN Wei et al. Synthesizing photo-real talking head via trajectory-guided sample selection[C]//Proceedings of the International Speech Communication Association. Makuhari, Japan:IEEE Press, 2010:446-449.
[13] Ze H, Senior A, Schuster M. Statistical parametric speech synthesis using deep neural networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada:IEEE Press, 2013:7962-7966.
[14] Hinton G, DENG Li, YU Dong, et al. Deep neural networks for acoustic modeling in speech recognition:The shared views of four research groups[J]. Signal Processing Magazine, IEEE, 2012, 29(6):82-97.
[15] FAN Yuchen, QIAN Yao, XIE Fenglong, et al. TTS synthesis with bidirectional LSTM based recurrent neural networks[C]//Proceedings of the International Speech Communication Association. Singapore:IEEE Press, 2014:1964-1968.
[16] Kang S Y, Qian X J, Meng H. Multi-distribution deep belief network for speech synthesis[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada:IEEE Press, 2013:8012-8016.
[17] Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 1997, 45(11):2673-2681.
[18] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8):1735-1780.
[19] FAN Bo, WANG Lijuan, Song F K, et al. Photo-real talking head with deep bidirectional LSTM[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Brisbane, Australia:IEEE Press, 2015:4884-4888.
[20] Cootes T F, Edwards G J, Taylor C J. Active appearance models[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(6):681-685.
[21] Werbos P J. Backpropagation through time:What it does and how to do it[J]. Proceedings of the IEEE, 1990, 78(10):1550-1560.
[22] Williams R J, Zipser D. Gradient-based learning algorithms for recurrent networks and their computational complexity[J]. Back-propagation:Theory, Architectures and Applications, 1995:433-486.
[23] Pérez P, Gangnet M, Blake A. Poisson image editing[C]//Proceedings of the ACM Transactions on Graphics. New York, NY, USA:ACM, 2003:313-318.
[24] WANG Qiang, ZHANG Weiwei, TANG Xiaoou, et al. Real-time bayesian 3-D pose tracking[J]. Circuits and Systems for Video Technology, IEEE Transactions on, 2006, 16(12):1533-1541.
[25] Jolliffe I T. Principal component analysis[J]. Springer Berlin, 1986, 87(100):41-64.
[26] Stegmann M B. Active appearance models:Theory extensions and cases[J]. Informatics & Mathematical Modelling, 2000, 1(6):748-754.
url: http://dx.doi.org/matics
[27] Roweis S. EM algorithms for PCA and SPCA[J]. Advances in Neural Information Processing Systems, 1999, 10:626-632.
[28] Cootes T F, Kittipanya-ngam P. Comparing variations on the active appearance model algorithm[C]//Proceedings of the 13th British Machine Vision Conference. Cardiff, Wales, UK:BMVA, 2002:1-10.
[29] Graves A, Mohamed A R, Hinton G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. Vancouver, Canada:IEEE Press, 2013:6645-6649.
[30] Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[31] Schuster M, Paliwal K K. Bidirectional recurrent neural networks[J]. Signal Processing, IEEE Transactions on, 1997, 45(11):2673-2681.
[32] Theobald B J, Fagel S, Bailly G, et al. LIPS2008:Visual speech synthesis challenge[C]//Proceedings of the International Speech Communication Association. Brisbane, Australia:IEEE Press, 2008:2310-2313.
[33] Young S, Evermann G, Gales M, et al. The HTK book[M]. Cambridge:Cambridge University Engineering Department, 2002."
[1] ZHU Wuxiang, LIAO Jingqiu, ZHAN Ziliang, TAN Zhijia. Systematic review and future perspectives of financial distress prediction studies: Back to the principle of finance[J]. Journal of Tsinghua University(Science and Technology), 2023, 63(9): 1467-1482.
[2] NURMEMET Yolwas, LIU Junhua, WUSHOUR Silamu, REYIMAN Tursun, DAWEL Abilhayer. Crosslingual acoustic modeling in Uyghur speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(4): 342-346.
[3] AISIKAER Rouzi, WANG Dong, LI Lantian, ZHENG Fang, ZHANG Xiaodong, JIN Panshi. Score domain speaking rate normalization for speaker recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(4): 337-341.
[4] SHEN Fangyao, DAI Guojun, DAI Chenglei, GUO Hongjie, ZHANG Hua. CTR prediction for online advertising based on a features conjunction model[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(4): 374-379.
[5] ZHENG Jun, LI Wenqing. Calibration of 3-D measurement system based on a double position sensitive detectors[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(4): 411-416.
[6] ZHANG Jing, HUANG Degen, HUANG Kaiyu, LIU Zhuang, MENG Xiangzhu. λ-active learning based microblog-oriented Chinese word segmentation[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 260-265.
[7] WANG Yuanlong, LI Ru, ZHANG Hu, WANG Zhiqiang. Causal options in Chinese reading comprehension[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(3): 272-278.
[8] YANG Qianwen, SUN Fuchun. Remote sensing image recognition based on generalized regularized auto-encoders[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(2): 113-121.
[9] ZHANG Yangsen, ZHENG Jia, HUANG Gaijuan, JIANG Yuru. Microblog sentiment analysis method based on a double attention model[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(2): 122-130.
[10] YI Jiangyan, TAO Jianhua, LIU Bin, WEN Zhengqi. Transfer learning for acoustic modeling of noise robust speech recognition[J]. Journal of Tsinghua University(Science and Technology), 2018, 58(1): 55-60.
[11] Hankiz Yilahun, Gulmire Imam, Maynur Ablitip, Guljamal Mamateli, Askar Hamdulla. Undulating scale of intonations of exclamatory Uyghur sentences[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(12): 1254-1258.
[12] ZENG Xiang, YANG Zhebiao, XU Zhen, LU Xinzheng. Fire spread simulations of building groups in rural areas[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(12): 1331-1337.
[13] TIAN Chuan, YE Xiaojun, WANG Zuliang, LI Xin. Multi-tag identification and collision avoidance with RFIDs in blood management[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(11): 1121-1126.
[14] LU Wenhuan, QU Yuexin, YANG Yalong, WANG Jianrong, DANG Jianwu. Hybrid feature extraction from ultrasound images for a silent speech interface[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(11): 1159-1162,1169.
[15] ZHANG Min, DING Biyuan, MA Weizhi, TAN Yunzhi, LIU Yiqun, MA Shaoping. Hybrid recommendation approach enhanced by deep learning[J]. Journal of Tsinghua University(Science and Technology), 2017, 57(10): 1014-1021.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
Copyright © Journal of Tsinghua University(Science and Technology), All Rights Reserved.
Powered by Beijing Magtech Co. Ltd