Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2017, Vol. 57 Issue (9): 921-925    DOI: 10.16511/j.cnki.qhdxxb.2017.26.041
  计算机科学与技术 本期目录 | 过刊浏览 | 高级检索 |
基于Kinect辅助的机器人带噪语音识别
王建荣1, 高永春1, 张句1, 魏建国2, 党建武1
1. 天津大学 计算机科学与技术学院, 天津 300350;
2. 天津大学 软件学院, 天津 300350
Automatic speech recognition by a Kinect sensor for a robot under ego noises
WANG Jianrong1, GAO Yongchun1, ZHANG Ju1, WEI Jianguo2, DANG Jianwu1
1. School of Computer Science and Technology, Tianjin University, Tianjin 300350, China;
2. School of Computer Software, Tianjin University, Tianjin 300350, China
全文: PDF(1454 KB)  
输出: BibTeX | EndNote (RIS)      
摘要 音视频信息融合可以提升机器人在噪声环境下的语音识别性能。然而受说话者的头部旋转、唇部尺寸不一、距摄像头距离不固定以及光照等因素影响,唇部信息不能得到有效的全面表征。该文提出融合机器人与Kinect的多模态系统。该系统采用Kinect获取3-D数据和视觉信息,并使用3-D数据重构侧唇来补充音视频信息。一系列基于特征融合和决策融合方法的结果表明:该文提出的多模态系统优于基于音视频单流和双流的语音识别系统,能够辅助机器人在自身噪声环境下的语音识别。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王建荣
高永春
张句
魏建国
党建武
关键词 仿人机器人自身噪声自动语音识别Kinect多模态系统    
Abstract:Audio-visual integration can effectively improve automatic speech recognition for robots under ego noises. However, head rotations lips movement differences, camera-subject distance and lighting variations degrade the automatic speech recognition (ASR) accuracy. This paper describes robot with a Kinect sensor in a multi-modal system. The Kinect provides 3-D data and visual information. The lip profiles are rebuilt using the 3-D data to get more accurate information from the video. Different fusion methods were investigated to incorporate the available multimodal information. Tests under ego noises of the robot demonstrate that the multi-modal system is superior to traditional automatic audio and audio-visual speech recognition with improved speech recognition robustness.
Key wordshumanoid robot    ego noises    automatic speech recognition    Kinect multi-sensor    multi-modal system
收稿日期: 2016-06-27      出版日期: 2017-09-15
ZTFLH:  TP242  
  TN912.34  
通讯作者: 魏建国,教授,E-mail:jianguo@tju.edu.cn     E-mail: jianguo@tju.edu.cn
引用本文:   
王建荣, 高永春, 张句, 魏建国, 党建武. 基于Kinect辅助的机器人带噪语音识别[J]. 清华大学学报(自然科学版), 2017, 57(9): 921-925.
WANG Jianrong, GAO Yongchun, ZHANG Ju, WEI Jianguo, DANG Jianwu. Automatic speech recognition by a Kinect sensor for a robot under ego noises. Journal of Tsinghua University(Science and Technology), 2017, 57(9): 921-925.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2017.26.041  或          http://jst.tsinghuajournals.com/CN/Y2017/V57/I9/921
  图1 融合Kinect和机器人的多模态系统
  图2 基于3GD 数据重构右侧唇过程
  表1 本文采用的融合方法
  图3 视频单流和融合视频与3GD 特征双流特征融合的语音识别结果
  表2 不同信噪比下各个流的最优权值
  图4 基于特征融合的识别结果
  图5 基于决策融合的识别结果
  图6 基于特征融合和决策融合的识别结果
[1] Breazeal C L. Designing Sociable Robots[M]. Massachusetts:MIT Press, 2004.
[2] Yamamoto S, Nakadai K, Tsujino H, et al. Assessment of general applicability of robot audition system by recognizing three simultaneous speeches[C]//2004 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2004:2111-2116.
[3] Luo Z, Zhao J. Speech recognition and its application in voice-based robot control system[C]//2004 International Conference on Intelligent Mechatronics and Automation. Piscataway, NJ, USA:IEEE Press, 2004:960-963.
[4] Ince G, Nakadai K, Rodemann T, et al. Ego noise suppression of a robot using template subtraction[C]//2009 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2009:199-204.
[5] Ince G, Nakadai K, Rodemann T, et al. Assessment of single-channel ego noise estimation methods[J]. IEEE/RSJ International Conference on Intelligent Robots & Systems, 2011, 32(14):106-111.
[6] Brandstein M, Ward D. Microphone Arrays:Signal Processing Techniques and Applications[M]. Berlin:Springer Science & Business Media, 2001.
[7] Valin J M, Rouat J, Michaud F. Enhanced robot audition based on microphone array source separation with post-filter[C]//2004 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2004:2123-2128.
[8] Yamamoto S, Nakadai K, Nakano M, et al. Real-time robot audition system that recognizes simultaneous speech in the real world[C]//2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2006:5333-5338.
[9] Cohen I, Berdugo B. Speech enhancement for non-stationary noise environments[J]. Signal Processing, 2001, 81(11):2403-2418.
[10] Cohen I, Berdugo B. Noise estimation by minima controlled recursive averaging for robust speech enhancement[J]. Signal Processing Letters, IEEE, 2002, 9(1):12-15.
[11] Wang J, Zhang J, Wei J, et al. Automatic speech recognition under robot ego noises[C]//20149th International Symposium on Chinese Spoken Language. Piscataway, NJ, USA:IEEE Press, 2014:377.
[12] Neti C, Potamianos G, Luettin J, et al. Audio Visual Speech Recognition[R]. Martigny:IDIAP, 2000.
[13] Potamianos G, Graf H P, Cosatto E. An image transform approach for HMM based automatic lipreading[C]//1998 International Conference on Image Processing. Piscataway, NJ, USA:IEEE Press, 1998:173-177.
[14] Shin J, Lee J, Kim D. Real-time lip reading system for isolated Korean word recognition[J]. Pattern Recognition, 2011, 44(3):559-571.
[15] Koiwa T, Nakadai K, Imura J. Coarse speech recognition by audio-visual integration based on missing feature theory[C]//2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2007:1751-1756.
[16] Yoshida T, Nakadai K, Okuno H G. Automatic speech recognition improved by two-layered audio-visual integration for robot audition[C]//20099th IEEE-RAS International Conference on Humanoid Robots. Piscataway, NJ, USA:IEEE Press, 2009:604-609.
[17] Yoshida T, Nakadai K, Okuno H G. Two-layered audio-visual speech recognition for robots in noisy environments[C]//2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. Piscataway, NJ, USA:IEEE Press, 2010:988-993.
[18] Liu H, Fan T, Wu P. Audio-visual keyword spotting based on adaptive decision fusion under noisy conditions for human-robot interaction[C]//2014 IEEE International Conference on Robotics and Automation (ICRA). Piscataway, NJ, USA:IEEE Press, 2014:6644-6651.
[19] Webb J, Ashley J. Beginning Kinect Programming with the Microsoft Kinect SDK[M]. Berkeley:Apress, 2012.
[20] Hong X, Yao H, Wan Y, et al. A PCA based visual DCT feature extraction method for lip-reading[C]//2006 International Conference on Intelligent Information Hiding and Multimedia. Los Alamitors, CA, USA:IEEE Computer Society, 2006:321-326.
[1] 张继文, 宋立滨, 许君杰, 石循磊, 刘莉. 仿人足球机器人的非预定义足球检测算法[J]. 清华大学学报(自然科学版), 2019, 59(4): 298-305.
[2] 张继文, 刘莉, 陈恳. 小型仿人足球机器人MOS-7的系统设计及局部优化[J]. 清华大学学报(自然科学版), 2016, 56(8): 811-817.
[3] 张继文, 刘莉, 陈恳. 基于AHRS反馈的仿人机器人步行稳定控制[J]. 清华大学学报(自然科学版), 2016, 56(8): 818-823.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn