Please wait a minute...
 首页  期刊介绍 期刊订阅 联系我们 横山亮次奖 百年刊庆
 
最新录用  |  预出版  |  当期目录  |  过刊浏览  |  阅读排行  |  下载排行  |  引用排行  |  横山亮次奖  |  百年刊庆
清华大学学报(自然科学版)  2022, Vol. 62 Issue (9): 1435-1441    DOI: 10.16511/j.cnki.qhdxxb.2022.25.006
  数据库 本期目录 | 过刊浏览 | 高级检索 |
大数据处理系统中面向GPU加速DNN推理的模型共享
丁光耀, 陈启航, 徐辰, 钱卫宁, 周傲英
华东师范大学 数据科学与工程学院, 上海 200333
Model sharing for GPU-accelerated DNN inference in big data processing systems
DING Guangyao, CHEN Qihang, XU Chen, QIAN Weining, ZHOU Aoying
School of Data Science and Engineering, East China Normal University, Shanghai 200333, China
全文: PDF(3873 KB)   HTML
输出: BibTeX | EndNote (RIS)      
摘要 近年来,学术和工业界广泛利用大数据处理系统来处理视频分析等领域基于深度神经网络(deep neural networks,DNN)的推理负载。在这种场景下,因大数据系统中多个并行推理任务重复加载相同且只读的DNN模型,导致系统无法充分利用GPU资源,成为了推理性能提升的瓶颈。针对该问题,该文提出了一个面向单GPU卡的模型共享技术,在DNN推理任务之间共享同一份模型数据。在此基础上,为了使模型共享技术作用于分布式环境下的每一块GPU,该文还设计了支持多GPU卡模型共享的分配器。将上述优化技术集成到在GPU平台上运行的Spark中,实现了一个支持大规模推理负载的分布式原型系统。实验结果表明,针对基于YOLO-v3的交通视频处理负载,相对于未采用模型共享技术的系统,模型共享技术能够提升系统吞吐量达136%。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
丁光耀
陈启航
徐辰
钱卫宁
周傲英
关键词 大数据处理系统DNN推理GPU显存模型共享    
Abstract:Big data processing is being widely used in academia and industry to handle DNN-based inference workloads for fields such as video analyses. In such cases, multiple parallel inference tasks in the big data processing system repeatedly load the same, read-only DNN model so the system does not fully utilize the GPU resources which creates a bottleneck that limits the inference performance. This paper presents a model sharing technique for single GPU cards that enables sharing of the same model among various DNN inference tasks. An allocator is used to make the model sharing technique work for each GPU in the distributed environment. This method was implemented in Spark on a GPU platform in a distributed data processing system that supports large-scale inference workloads. Tests show that for video analyses on the YOLO-v3 model, the model sharing reduces the GPU memory overhead and improves system throughput by up to 136% compared to a system without the model sharing technique.
Key wordsbig data processing system    DNN inference    GPU    GPU memory    model sharing
收稿日期: 2021-07-15      出版日期: 2022-08-18
基金资助:徐辰,副教授,E-mail:cxu@dase.ecnu.edu.cn
引用本文:   
丁光耀, 陈启航, 徐辰, 钱卫宁, 周傲英. 大数据处理系统中面向GPU加速DNN推理的模型共享[J]. 清华大学学报(自然科学版), 2022, 62(9): 1435-1441.
DING Guangyao, CHEN Qihang, XU Chen, QIAN Weining, ZHOU Aoying. Model sharing for GPU-accelerated DNN inference in big data processing systems. Journal of Tsinghua University(Science and Technology), 2022, 62(9): 1435-1441.
链接本文:  
http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2022.25.006  或          http://jst.tsinghuajournals.com/CN/Y2022/V62/I9/1435
  
  
  
  
  
  
  
  
  
  
  
  
[1] DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters[C]// Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI). Carlsbad, USA: USENIX Association, 2004: 137-150.
[2] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing[C]// Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Berkeley, United States: USENIX Association, 2012: 15-28.
[3] CARBONE P, KATSIFODIMOS A, EWEN S, et al. Apache FlinkTM: Stream and batch processing in a single engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4): 28-38.
[4] HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE, 2016: 770-778.
[5] KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks[C]// Proceedings of the 25th International Conference on Neural Information Processing Systems. Red Hook, USA: Curran Associates Inc., 2012: 1097-1105.
[6] ANDERSON M R, CAFARELLA M, ROS G, et al. Physical representation-based predicate optimization for a visual analytics database[C]// 2019 IEEE 35th International Conference on Data Engineering (ICDE). Macao, China: IEEE, 2019: 1466-1477.
[7] CHEN C, LI K L, OUYANG A J, et al. GFlink: An in-memory computing architecture on heterogeneous CPU-GPU clusters for big data[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(6): 1275-1288.
[8] REDMON J, FARHADI A. YOLOv3: An incremental improvement [J/OL].(2018-04-08). https://arxiv.org/abs/1804.02767.
[9] TensorFlowOnSpark [EB/OL].[2021-10-16]. https://github.com/yahoo/TensorFlowOnSpark.
[10] Flink-ai-extented [EB/OL].[2021-10-11]. https://github.com/alibaba/flink-ai-extended.
[11] ZHOU H S, BATENI S, LIU C et al. S3DNN: Supervised streaming and scheduling for GPU-accelerated real-time DNN workloads[C]//2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). Porto, Portugal: IEEE, 2018: 190-201.
[12] HASSAAN M, ELGHANDOUR I. A real-time big data analysis framework on a CPU/GPU heterogeneous cluster: A meteorological application case study[C]// Proceedings of the 3rd ACM International Conference on Big Data Computing (BDCAT). New York, USA: ACM, 2016: 168-177.
[13] ABADI M, BARHAM P, CHEN J M, et al. TensorFlow: A System for Large-Scale Machine Learning[C]// Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, GA, USA: USENIX Association, 2016: 265-283.
[14] PASZKE A, GROSS S, MASSA F, et al. PyTorch: An imperative style, high-performance deep learning library[C]// Proceedings of the 32nd Advances in Neural Information Processing Systems (NeurIPS). Marylebone, London, UK: MIT Press, 2019: 8024-8035.
[15] Darknet: Open Source Neural Networks in C [EB/OL]. [2016-01-30]. http://pjreddie.com/darknet/.
[16] YUAN Y, SALMI M F, HUAI Y, et al. Spark-GPU: An accelerated in-memory data processing engine on clusters[C]// 2016 IEEE International Conference on Big Data (Big Data). Washington, USA: IEEE, 2016: 273-283.
[17] LI P L, LUO Y, ZHANG N, et al. HeteroSpark: A heterogeneous CPU/GPU Spark platform for machine learning algorithms[C]// 2015 IEEE International Conference on Networking, Architecture and Storage (NAS). Boston, USA: IEEE, 2015: 347-348.
[18] ESSERTEL G, TAHBOUB R, DECKER J. Flare: Optimizing apache spark with native compilation for scale-up architectures and medium-size data[C]// Proceedings of the 13th Symposium on Operating Systems Design and Implementation (OSDI). Carlsbad, USA: USENIX Association, 2018: 799-815.
[1] 蒋文斌, 王宏斌, 刘湃, 陈雨浩. 基于AVX2指令集的深度学习混合运算策略[J]. 清华大学学报(自然科学版), 2020, 60(5): 408-414.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 《清华大学学报(自然科学版)》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn