Automatic speech recognition by a Kinect sensor for a robot under ego noises
WANG Jianrong1, GAO Yongchun1, ZHANG Ju1, WEI Jianguo2, DANG Jianwu1
1. School of Computer Science and Technology, Tianjin University, Tianjin 300350, China;
2. School of Computer Software, Tianjin University, Tianjin 300350, China
Abstract:Audio-visual integration can effectively improve automatic speech recognition for robots under ego noises. However, head rotations lips movement differences, camera-subject distance and lighting variations degrade the automatic speech recognition (ASR) accuracy. This paper describes robot with a Kinect sensor in a multi-modal system. The Kinect provides 3-D data and visual information. The lip profiles are rebuilt using the 3-D data to get more accurate information from the video. Different fusion methods were investigated to incorporate the available multimodal information. Tests under ego noises of the robot demonstrate that the multi-modal system is superior to traditional automatic audio and audio-visual speech recognition with improved speech recognition robustness.
Breazeal C L. Designing Sociable Robots[M]. Massachusetts:MIT Press, 2004.
[2]
Yamamoto S, Nakadai K, Tsujino H, et al. Assessment of general applicability of robot audition system by recognizing three simultaneous speeches[C]//2004 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2004:2111-2116.
[3]
Luo Z, Zhao J. Speech recognition and its application in voice-based robot control system[C]//2004 International Conference on Intelligent Mechatronics and Automation. Piscataway, NJ, USA:IEEE Press, 2004:960-963.
[4]
Ince G, Nakadai K, Rodemann T, et al. Ego noise suppression of a robot using template subtraction[C]//2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2009:199-204.
[5]
Ince G, Nakadai K, Rodemann T, et al. Assessment of single-channel ego noise estimation methods[J]. IEEE/RSJ International Conference on Intelligent Robots & Systems, 2011, 32(14):106-111.
[6]
Brandstein M, Ward D. Microphone Arrays:Signal Processing Techniques and Applications[M]. Berlin:Springer Science & Business Media, 2001.
[7]
Valin J M, Rouat J, Michaud F. Enhanced robot audition based on microphone array source separation with post-filter[C]//2004 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2004:2123-2128.
[8]
Yamamoto S, Nakadai K, Nakano M, et al. Real-time robot audition system that recognizes simultaneous speech in the real world[C]//2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2006:5333-5338.
[9]
Cohen I, Berdugo B. Speech enhancement for non-stationary noise environments[J]. Signal Processing, 2001, 81(11):2403-2418.
[10]
Cohen I, Berdugo B. Noise estimation by minima controlled recursive averaging for robust speech enhancement[J]. Signal Processing Letters, IEEE, 2002, 9(1):12-15.
[11]
Wang J, Zhang J, Wei J, et al. Automatic speech recognition under robot ego noises[C]//20149th International Symposium on Chinese Spoken Language. Piscataway, NJ, USA:IEEE Press, 2014:377.
Potamianos G, Graf H P, Cosatto E. An image transform approach for HMM based automatic lipreading[C]//1998 International Conference on Image Processing. Piscataway, NJ, USA:IEEE Press, 1998:173-177.
[14]
Shin J, Lee J, Kim D. Real-time lip reading system for isolated Korean word recognition[J]. Pattern Recognition, 2011, 44(3):559-571.
[15]
Koiwa T, Nakadai K, Imura J. Coarse speech recognition by audio-visual integration based on missing feature theory[C]//2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2007:1751-1756.
[16]
Yoshida T, Nakadai K, Okuno H G. Automatic speech recognition improved by two-layered audio-visual integration for robot audition[C]//20099th IEEE-RAS International Conference on Humanoid Robots. Piscataway, NJ, USA:IEEE Press, 2009:604-609.
[17]
Yoshida T, Nakadai K, Okuno H G. Two-layered audio-visual speech recognition for robots in noisy environments[C]//2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2010:988-993.
[18]
Liu H, Fan T, Wu P. Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction[C]//2014 IEEE International Conference on Robotics and Automation (ICRA). Piscataway, NJ, USA:IEEE Press, 2014:6644-6651.
[19]
Webb J, Ashley J. Beginning Kinect Programming with the Microsoft Kinect SDK[M]. Berkeley:Apress, 2012.
[20]
Hong X, Yao H, Wan Y, et al. A PCA based visual DCT feature extraction method for lip-reading[C]//2006 International Conference on Intelligent Information Hiding and Multimedia. Los Alamitors, CA, USA:IEEE Computer Society, 2006:321-326.