计算机科学与技术

基于Kinect辅助的机器人带噪语音识别

  • 王建荣 ,
  • 高永春 ,
  • 张句 ,
  • 魏建国 ,
  • 党建武
展开
  • 1. 天津大学 计算机科学与技术学院, 天津 300350;
    2. 天津大学 软件学院, 天津 300350

收稿日期: 2016-06-27

  网络出版日期: 2017-09-15

Automatic speech recognition by a Kinect sensor for a robot under ego noises

  • WANG Jianrong ,
  • GAO Yongchun ,
  • ZHANG Ju ,
  • WEI Jianguo ,
  • DANG Jianwu
Expand
  • 1. School of Computer Science and Technology, Tianjin University, Tianjin 300350, China;
    2. School of Computer Software, Tianjin University, Tianjin 300350, China

Received date: 2016-06-27

  Online published: 2017-09-15

摘要

音视频信息融合可以提升机器人在噪声环境下的语音识别性能。然而受说话者的头部旋转、唇部尺寸不一、距摄像头距离不固定以及光照等因素影响,唇部信息不能得到有效的全面表征。该文提出融合机器人与Kinect的多模态系统。该系统采用Kinect获取3-D数据和视觉信息,并使用3-D数据重构侧唇来补充音视频信息。一系列基于特征融合和决策融合方法的结果表明:该文提出的多模态系统优于基于音视频单流和双流的语音识别系统,能够辅助机器人在自身噪声环境下的语音识别。

本文引用格式

王建荣 , 高永春 , 张句 , 魏建国 , 党建武 . 基于Kinect辅助的机器人带噪语音识别[J]. 清华大学学报(自然科学版), 2017 , 57(9) : 921 -925 . DOI: 10.16511/j.cnki.qhdxxb.2017.26.041

Abstract

Audio-visual integration can effectively improve automatic speech recognition for robots under ego noises. However, head rotations lips movement differences, camera-subject distance and lighting variations degrade the automatic speech recognition (ASR) accuracy. This paper describes robot with a Kinect sensor in a multi-modal system. The Kinect provides 3-D data and visual information. The lip profiles are rebuilt using the 3-D data to get more accurate information from the video. Different fusion methods were investigated to incorporate the available multimodal information. Tests under ego noises of the robot demonstrate that the multi-modal system is superior to traditional automatic audio and audio-visual speech recognition with improved speech recognition robustness.

参考文献

[1] Breazeal C L. Designing Sociable Robots[M]. Massachusetts:MIT Press, 2004. [2] Yamamoto S, Nakadai K, Tsujino H, et al. Assessment of general applicability of robot audition system by recognizing three simultaneous speeches[C]//2004 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2004:2111-2116. [3] Luo Z, Zhao J. Speech recognition and its application in voice-based robot control system[C]//2004 International Conference on Intelligent Mechatronics and Automation. Piscataway, NJ, USA:IEEE Press, 2004:960-963. [4] Ince G, Nakadai K, Rodemann T, et al. Ego noise suppression of a robot using template subtraction[C]//2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2009:199-204. [5] Ince G, Nakadai K, Rodemann T, et al. Assessment of single-channel ego noise estimation methods[J]. IEEE/RSJ International Conference on Intelligent Robots & Systems, 2011, 32(14):106-111. [6] Brandstein M, Ward D. Microphone Arrays:Signal Processing Techniques and Applications[M]. Berlin:Springer Science & Business Media, 2001. [7] Valin J M, Rouat J, Michaud F. Enhanced robot audition based on microphone array source separation with post-filter[C]//2004 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2004:2123-2128. [8] Yamamoto S, Nakadai K, Nakano M, et al. Real-time robot audition system that recognizes simultaneous speech in the real world[C]//2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2006:5333-5338. [9] Cohen I, Berdugo B. Speech enhancement for non-stationary noise environments[J]. Signal Processing, 2001, 81(11):2403-2418. [10] Cohen I, Berdugo B. Noise estimation by minima controlled recursive averaging for robust speech enhancement[J]. Signal Processing Letters, IEEE, 2002, 9(1):12-15. [11] Wang J, Zhang J, Wei J, et al. Automatic speech recognition under robot ego noises[C]//20149th International Symposium on Chinese Spoken Language. Piscataway, NJ, USA:IEEE Press, 2014:377. [12] Neti C, Potamianos G, Luettin J, et al. Audio Visual Speech Recognition[R]. Martigny:IDIAP, 2000. [13] Potamianos G, Graf H P, Cosatto E. An image transform approach for HMM based automatic lipreading[C]//1998 International Conference on Image Processing. Piscataway, NJ, USA:IEEE Press, 1998:173-177. [14] Shin J, Lee J, Kim D. Real-time lip reading system for isolated Korean word recognition[J]. Pattern Recognition, 2011, 44(3):559-571. [15] Koiwa T, Nakadai K, Imura J. Coarse speech recognition by audio-visual integration based on missing feature theory[C]//2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2007:1751-1756. [16] Yoshida T, Nakadai K, Okuno H G. Automatic speech recognition improved by two-layered audio-visual integration for robot audition[C]//20099th IEEE-RAS International Conference on Humanoid Robots. Piscataway, NJ, USA:IEEE Press, 2009:604-609. [17] Yoshida T, Nakadai K, Okuno H G. Two-layered audio-visual speech recognition for robots in noisy environments[C]//2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2010:988-993. [18] Liu H, Fan T, Wu P. Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction[C]//2014 IEEE International Conference on Robotics and Automation (ICRA). Piscataway, NJ, USA:IEEE Press, 2014:6644-6651. [19] Webb J, Ashley J. Beginning Kinect Programming with the Microsoft Kinect SDK[M]. Berkeley:Apress, 2012. [20] Hong X, Yao H, Wan Y, et al. A PCA based visual DCT feature extraction method for lip-reading[C]//2006 International Conference on Intelligent Information Hiding and Multimedia. Los Alamitors, CA, USA:IEEE Computer Society, 2006:321-326.
文章导航

/