PS-Hybrid: 面向大规模推荐模型训练的混合通信框架

doi:10.16511/j.cnki.qhdxxb.2021.22.041

全文: PDF(4639 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要传统的分布式深度学习训练系统大多基于参数服务器和全局规约通信框架,缺陷日益显著：参数量大,基于全局规约的去中心化通信架构由于无法存储全量模型而无法使用；通信量大,基于参数服务器的中心化通信架构面临着严重的通信瓶颈。为了解决以上问题,该文提出了面向大规模深度学习推荐模型的混合通信训练框架PS-Hybrid,分离了嵌入层参数和其他参数的通信逻辑,实现了PS-Hybrid原型系统。实验结果证明了所提出的混合通信方案能够比纯参数服务器方案取得更好的性能,在16个计算节点下比TensorFlow-PS加速48%。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS

	作者相关文章
	苗旭鹏
	张敏旭
	邵蓥侠
	崔斌

关键词 ：推荐模型, 分布式深度学习, 参数服务器, 全局规约

Abstract：Most traditional distributed deep learning training systems have been based on parameter servers which have centralized communication architectures that face serious communication bottlenecks due to the large amounts of communications and AllReduce communication frameworks which have decentralized communication architectures that cannot store the entire model due to the large number of parameters. This paper presents PS-Hybrid, a hybrid communication framework, for large deep learning recommendation model training which decouples the communication logic from the embedded parameters and other parameters. Tests show that this prototype system achieves better performance than previous parameter servers for recommendation model training. The system is 48% faster than TensorFlow-PS with 16 computing nodes.

Key words： recommendation model distributed deep learning parameter server AllReduce

收稿日期: 2021-07-22 出版日期: 2022-08-18

基金资助:崔斌,教授,E-mail:bin.cui@pku.edu.cn

引用本文:

苗旭鹏, 张敏旭, 邵蓥侠, 崔斌. PS-Hybrid: 面向大规模推荐模型训练的混合通信框架[J]. 清华大学学报（自然科学版）, 2022, 62(9): 1417-1425.
MIAO Xupeng, ZHANG Minxu, SHAO Yingxia, CUI Bin. PS-Hybrid: Hybrid communication framework for large recommendation model training. Journal of Tsinghua University(Science and Technology), 2022, 62(9): 1417-1425.

链接本文:

http://jst.tsinghuajournals.com/CN/10.16511/j.cnki.qhdxxb.2021.22.041 或 http://jst.tsinghuajournals.com/CN/Y2022/V62/I9/1417

[1] FEDUS W, ZOPH B, SHAZEER N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. [Z/OL]．(2021-01-11)[2021-05-12]. https://arxiv.org/abs/2101.03961.
[2] CHENG H T, KOC L, HARMSEN J, et al. Wide & deep learning for recommender systems[C]// Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. Boston, USA, 2016: 7-10.
[3] WANG R X, FU B, FU G, et al. Deep & cross network for ad click predictions[C]// Proceedings of the ADKDD'17. Halifax, Canada, 2017: 1-7.
[4] LI M, ANDERSEN D G, PARK J W, et al. Scaling distributed machine learning with the parameter server[C]// Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. Broomfield, USA, 2014: 583-598.
[5] JIANG J, YU L L, JIANG J W, et al. Angel: A new large-scale machine learning system[J]. National Science Review, 2018, 5(2): 216-236.
[6] LI M, ANDERSEN D G, SMOLA A, et al. Communication efficient distributed machine learning with the parameter server[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, Canada, 2014, 1: 19-27.
[7] ABADI M, BARHAM P, CHEN J M, et al. TensorFlow: A system for large-scale machine learning[C]// Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. Savannah, USA, 2016: 265-283.
[8] PASZKE A, GROSS S, MASSA F, et al. PyTorch: An imperative style, high-performance deep learning library[C]// Proceedings of the 33rd Conference on Neural Information Processing Systems. Vancouver, Canada, 2019: 8026-8037.
[9] LI S, ZHAO Y L, VARMA R, et al. PyTorch distributed: Experiences on accelerating data parallel training[J]. Proceedings of the VLDB Endowment, 2020, 13(12): 3005-3018.
[10] PS-Lite. A light and efficient implementation of the parameter server framework [R/OL]. [2021-05-12]． https://github.com/dmlc/ps-lite.
[11] NVIDIA Developer. NVIDIA collective communications library (NCCL).[R/OL]. [2021-05-12]． https://developer.nvidia.com/nccl.
[12] JIANG B Y, DENG C, YI H M, et al. XDL: An industrial deep learning framework for high-dimensional sparse data[C]// Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data. Anchorage, USA, 2019: 1-9.
[13] PATARASUK P, YUAN X. Bandwidth efficient all-reduce operation on tree topologies[C]// 2007 IEEE International Parallel and Distributed Processing Symposium. Long Beach, USA, 2007: 1-8.
[14] ANDREW G. Bringing HPC techniques to deep learning.[R/OL]. [2021-05-12]． https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/.
[15] JIA X Y, SONG S T, HE W, et al. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes[C]// Proceedings of the 32nd Conference on Neural Information Processing Systems. Montréal, Canada, 2018.
[16] Hetu. A high-performance distributed deep learning system targeting large-scale and automated distributed training [R/OL]. [2021-05-12]． https://github.com/PKU-DAIR/Hetu.
[17] PENG Y H, ZHU Y B, CHEN Y R, et al. A generic communication scheduler for distributed DNN training acceleration[C]// Proceedings of the 27th ACM Symposium on Operating Systems Principles. Huntsville, Canada, 2019: 16-29.
[18] SHAN Y, HOENS T R, JIAO J, et al. Deep crossing: Web-scale modeling without manually crafted combinatorial features[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, USA, 2016: 255-262.