Abstract:Big data processing is being widely used in academia and industry to handle DNN-based inference workloads for fields such as video analyses. In such cases, multiple parallel inference tasks in the big data processing system repeatedly load the same, read-only DNN model so the system does not fully utilize the GPU resources which creates a bottleneck that limits the inference performance. This paper presents a model sharing technique for single GPU cards that enables sharing of the same model among various DNN inference tasks. An allocator is used to make the model sharing technique work for each GPU in the distributed environment. This method was implemented in Spark on a GPU platform in a distributed data processing system that supports large-scale inference workloads. Tests show that for video analyses on the YOLO-v3 model, the model sharing reduces the GPU memory overhead and improves system throughput by up to 136% compared to a system without the model sharing technique.
[1] DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters[C]// Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI). Carlsbad, USA: USENIX Association, 2004: 137-150. [2] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]// Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Berkeley, United States: USENIX Association, 2012: 15-28. [3] CARBONE P, KATSIFODIMOS A, EWEN S, et al. Apache FlinkTM: Stream and batch processing in a single engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4): 28-38. [4] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016: 770-778. [5] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]// Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, USA: Curran Associates Inc., 2012: 1097-1105. [6] ANDERSON M R, CAFARELLA M, ROS G, et al. Physical representation-based predicate optimization for a visual analytics database[C]// 2019 IEEE 35th International Conference on Data Engineering (ICDE). Macao, China: IEEE, 2019: 1466-1477. [7] CHEN C, LI K L, OUYANG A J, et al. GFlink: An in-memory computing architecture on heterogeneous CPU-GPU clusters for big data[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(6): 1275-1288. [8] REDMON J, FARHADI A. YOLOv3: An incremental improvement [J/OL].(2018-04-08). https://arxiv.org/abs/1804.02767. [9] TensorFlowOnSpark [EB/OL].[2021-10-16]. https://github.com/yahoo/TensorFlowOnSpark. [10] Flink-ai-extented [EB/OL].[2021-10-11]. https://github.com/alibaba/flink-ai-extended. [11] ZHOU H S, BATENI S, LIU C et al. S3DNN: Supervised streaming and scheduling for GPU-accelerated real-time DNN workloads[C]//2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). Porto, Portugal: IEEE, 2018: 190-201. [12] HASSAAN M, ELGHANDOUR I. A real-time big data analysis framework on a CPU/GPU heterogeneous cluster: A meteorological application case study[C]// Proceedings of the 3rd ACM International Conference on Big Data Computing (BDCAT). New York, USA: ACM, 2016: 168-177. [13] ABADI M, BARHAM P, CHEN J M, et al. TensorFlow: A System for Large-Scale Machine Learning[C]// Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, GA, USA: USENIX Association, 2016: 265-283. [14] PASZKE A, GROSS S, MASSA F, et al. PyTorch: An imperative style, high-performance deep learning library[C]// Proceedings of the 32nd Advances in Neural Information Processing Systems (NeurIPS). Marylebone, London, UK: MIT Press, 2019: 8024-8035. [15] Darknet: Open Source Neural Networks in C [EB/OL]. [2016-01-30]. http://pjreddie.com/darknet/. [16] YUAN Y, SALMI M F, HUAI Y, et al. Spark-GPU: An accelerated in-memory data processing engine on clusters[C]// 2016 IEEE International Conference on Big Data (Big Data). Washington, USA: IEEE, 2016: 273-283. [17] LI P L, LUO Y, ZHANG N, et al. HeteroSpark: A heterogeneous CPU/GPU Spark platform for machine learning algorithms[C]// 2015 IEEE International Conference on Networking, Architecture and Storage (NAS). Boston, USA: IEEE, 2015: 347-348. [18] ESSERTEL G, TAHBOUB R, DECKER J. Flare: Optimizing apache spark with native compilation for scale-up architectures and medium-size data[C]// Proceedings of the 13th Symposium on Operating Systems Design and Implementation (OSDI). Carlsbad, USA: USENIX Association, 2018: 799-815.