在公共云集群上进行深度学习的分布式培训

论文标题

在公共云集群上进行深度学习的分布式培训

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

论文作者

Shi, Shaohuai, Zhou, Xianhao, Song, Shutao, Wang, Xingyao, Zhu, Zilin, Huang, Xue, Jiang, Xinan, Zhou, Feihu, Guo, Zhenyu, Xie, Liqiang, Lan, Rui, Ouyang, Xianbin, Zhang, Yan, Wei, Jieqian, Gong, Jing, Lin, Weiliang, Gao, Ping, Meng, Peng, Xu, Xiaomin, Guo, Chenyang, Yang, Bo, Chen, Zhibo, Wu, Yongjian, Chu, Xiaowen

论文摘要

分布式培训技术已被广泛部署在密集GPU簇的大规模深神经网络（DNNS）培训中。但是，在公共云集群上，由于实例之间中等连接的带宽，传统的最先进的分布式培训系统在训练大型模型中的扩展无法很好地扩展。在本文中，我们提出了一个新的计算和通信有效的Top-K稀疏沟通库，用于分布式培训。为了进一步提高系统可伸缩性，我们通过提出简单但有效的多级数据缓存机制来优化I/O，并通过引入新型并行张量操作员来优化更新操作。 16节点腾讯云簇（带有8个NVIDIA TESLA V100 GPU的节点）的实验结果表明，我们的系统比CNN和变压器上现有的最新系统快25％-40％。我们终于在训练Resnet-50上打破了Dawnbench上的记录，而ImageNet上的前5位精度为93％。

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题