Estimation algorithm of driver's gaze zone based on lightweight spatial feature encoding network
ZHANG Mingfang1, LI Guilin1, WU Chuna2, WANG Li1, TONG Lianghao1
1. Beijing Key Laboratory of Urban Road Intelligent Traffic Control Technology, North China University of Technology, Beijing 100144, China; 2. Key Laboratory of Operation Safety Technology on Transport Vehicles, Research Institute of Highway, Ministry of Transport, Beijing 100088, China
Abstract:[Objective] The real-time monitoring of a driver's gaze region is essential for human-machine shared driving vehicles to understand and predict the driver's intentions. Because of the limited computational resources and storage capacity of in-vehicle platforms, existing gaze region estimation algorithms often hardly balance accuracy and real-time performance and ignore temporal information.[Methods] Therefore, this paper proposes a lightweight spatial feature encoding network (LSFENet) for driver gaze region estimation. First, the image sequence of the driver's upper body is captured by an RGB camera. Image preprocessing steps, including face alignment and glasses removal, are performed to obtain left- and right-eye images and facial keypoint coordinates to handle challenges such as cluttered backgrounds and facial occlusions in the captured images. Face alignment is conducted using the multi-task cascaded convolutional network algorithm, and the glasses are removed using the cycle-consistent adversarial network algorithm. Second, we build the LSFENet feature extraction network based on the GCSbottleneck module to improve the MobileNetV2 architecture, since the inverted residual structure in the MobileNetV2 network requires a significant amount of memory and floating-point operations and ignores the redundancy and the correlation among the feature maps. We embed a ghost module to improve memory consumption and integrate the channel and spatial attention modules to extract the cross-channel and spatial information from the feature map. Next, the Kronecker product is used to fuse eye features with facial keypoint features to reduce the impact of the information complexity imbalance. Then, the fused features from the images at continuous frames are input into a recurrent neural network to estimate the gaze zone of the image sequence. Finally, the proposed network is evaluated using the public driver gaze in the wild (DGW) dataset and a self-collected dataset. The evaluation metrics include the number of parameters, the floating-point operations per second (FLOPs), the frames per second (FPS), and the F1 score.[Results] The experimental results showed the following:(1) The gaze region estimation accuracy of the proposed algorithm was 97.08%, which was approximately 7% higher than that of the original MobileNetV2. Additionally, both the number of parameters and FLOPs were reduced by 22.5%, and the FPS was improved by 36.43%. The proposed network had a frame rate of approximately 103 FPS and satisfied the computational efficiency and accuracy requirements under in-vehicle environments. (2) The estimation accuracies of the gaze regions 1, 2, 3, 4, and 9 were over 85% for the proposed algorithm. The macro-average and micro-average precisions of the DGW dataset reached 74.32% and 76.01%, respectively. (3) The proposed algorithm provided high classification accuracy for fine-grained eye images with small intra-class differences. (4) The visualization results of the class activation mapping demonstrated that the proposed algorithm had strong adaptability to various lighting conditions and glass occlusion situations.[Conclusions] The research results are of great significance for the recognition of a driver's visual distraction states.
[1]王庭晗, 罗禹贡, 刘金鑫, 等. 基于考虑状态分布的深度确定性策略梯度算法的端到端自动驾驶策略[J]. 清华大学学报(自然科学版), 2021, 61(9):881-888. WANG T H, LUO Y G, LIU J X, et al. End-to-end self-driving policy based on the deep deterministic policy gradient algorithm considering the state distribution[J]. Journal of Tsinghua University (Science and Technology), 2021, 61(9):881-888. (in Chinese) [2]宗长富, 代昌华, 张东. 智能汽车的人机共驾技术研究现状和发展趋势[J]. 中国公路学报, 2021, 34(6):214-237. ZONG C F, DAI C H, ZHANG D. Human-machine interaction technology of intelligent vehicles:Current development trends and future directions[J]. China Journal of Highway and Transport, 2021, 34(6):214-237. (in Chinese) [3]CHANG W J, CHEN L B, CHIOU Y Z. Design and implementation of a drowsiness-fatigue-detection system based on wearable smart glasses to increase road safety[J]. IEEE Transactions on Consumer Electronics, 2018, 64(4):461-469. [4]PLOPSKI A, HIRZLE T, NOROUZI N, et al. The eye in extended reality:A survey on gaze interaction and eye tracking in head-worn extended reality[J]. ACM Computing Surveys, 2023, 55(3):53. [5]SHI H L, CHEN L F, WANG X Y, et al. A nonintrusive and real-time classification method for driver's gaze region using an RGB camera[J]. Sustainability, 2022, 14(1):508. [6]YUAN G L, WANG Y F, YAN H Z, et al. Self-calibrated driver gaze estimation via gaze pattern learning[J]. Knowledge-Based Systems, 2022, 235:107630. [7]刘觅涵, 代欢欢. 基于RGB相机的驾驶员注视区域估计[J]. 现代计算机, 2019, 25(36):69-75. LIU M H, DAI H H. Driver gaze zone estimation based on RGB camera[J]. Modern Computer, 2019, 25(36):69-75. (in Chinese) [8]LUNDGREN M, HAMMARSTRAND L, MCKELVEY T. Driver-gaze zone estimation using Bayesian filtering and Gaussian processes[J]. IEEE Transactions on Intelligent Transportation Systems, 2016, 17(10):2739-2750. [9]LU F, SUGANO Y, OKABE T, et al. Adaptive linear regression for appearance-based gaze estimation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 36(10):2033-2046. [10]AUNSRI N, RATTAROM S. Novel eye-based features for head pose-free gaze estimation with web camera:New model and low-cost device[J]. Ain Shams Engineering Journal, 2022, 13(5):101731. [11]闫秋女, 张伟伟. 基于多模态特征融合的驾驶员注视区域估计[J]. 计算机与数字工程, 2022, 50(10):2217-2222. YAN Q N, ZHANG W W. Estimation of driver's gaze area based on multi-modal feature fusion[J]. Computer and Digital Engineering, 2022, 50(10):2217-2222. (in Chinese) [12]WANG Y F, YUAN G L, MI Z T, et al. Continuous driver's gaze zone estimation using RGB-D camera[J]. Sensors, 2019, 19(6):1287. [13]韩坤, 潘海为, 张伟, 等. 基于多模态医学图像的Alzheimer病分类方法[J]. 清华大学学报(自然科学版), 2020, 60(8):664-671, 682. HAN K, PAN H W, ZHANG W, et al. Alzheimer's disease classification method based on multi-modal medical images[J]. Journal of Tsinghua University (Science and Technology), 2020, 60(8):664-671, 682. (in Chinese) [14]RIBEIRO R F, COSTA P D P. Driver gaze zone dataset with depth data[C]//14th International Conference on Automatic Face & Gesture Recognition. Lille, France:IEEE, 2019:1-5. [15]GHOSH S, DHALL A, SHARMA G, et al. Speak2Label:Using domain knowledge for creating a large scale driver gaze zone estimation dataset[C]//IEEE/CVF International Conference on Computer Vision Workshops. Montreal, Canada:IEEE, 2021:2896-2905. [16]SANDLER M, HOWARD A, ZHU M L, et al. MobileNetv2:Inverted residuals and linear bottlenecks[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA:IEEE, 2018:4510-4520. [17]RANGESH A, ZHANG B W, TRIVEDI M M. Gaze preserving CycleGANs for eyeglass removal and persistent gaze estimation[J]. IEEE Transactions on Intelligent Vehicles, 2022, 7(2):377-386. [18]YANG Y R, LIU C S, CHANG F L, et al. Driver gaze zone estimation via head pose fusion assisted supervision and eye region weighted encoding[J]. IEEE Transactions on Consumer Electronics, 2021, 67(4):275-284. [19]KRAFKA K, KHOSLA A, KELLNHOFER P, et al. Eye tracking for everyone[C]//IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA:IEEE, 2016:2176-2184. [20]ASSI L, CHAMSEDDINE F, IBRAHIM P, et al. A global assessment of eye health and quality of life:A systematic review of systematic reviews[J]. JAMA Ophthalmology, 2021, 139(5):526-541. [21]ZHANG K P, ZHANG Z P, LI Z F, et al. Joint face detection and alignment using multitask cascaded convolutional networks[J]. IEEE Signal Processing Letters, 2016, 23(10):1499-1503. [22]ZHU J Y, PARK T, ISOLA P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//IEEE International Conference on Computer Vision. Venice, Italy:IEEE, 2017:2242-2251. [23]NAN Y H, JU J G, HUA Q Y, et al. A-MobileNet:An approach of facial expression recognition[J]. Alexandria Engineering Journal, 2022, 61(6):4435-4444. [24]HAN K, WANG Y H, TIAN Q, et al. GhostNet:More features from cheap operations[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA:IEEE, 2020:1577-1586. [25]WOO S, PARK J, LEE J Y, et al. CBAM:Convolutional block attention module[C]//15th European Conference on Computer Vision. Munich, Germany:Springer, 2018:3-19. [26]HOWARD A, SANDLER M, CHEN B, et al. Searching for MobileNetV3[C]//IEEE/CVF International Conference on Computer Vision. Seoul, Republic of Korea:IEEE, 2019:1314-1324.