Dapple：用于训练大型模型的管道数据并行方法

论文标题

Dapple：用于训练大型模型的管道数据并行方法

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

论文作者

Fan, Shiqing, Rong, Yi, Meng, Chen, Cao, Zongyan, Wang, Siyu, Zheng, Zhen, Wu, Chuan, Long, Guoping, Yang, Jun, Xia, Lixue, Diao, Lansong, Liu, Xiaoyong, Lin, Wei

论文摘要

在具有多元化互连功能的复杂GPU平台上训练大型DNN模型是一项具有挑战性的任务。最近，已提出管道培训是改善设备利用率的有效方法。但是，仍然有几个棘手的问题要解决：提高计算效率的同时确保收敛，并减少内存使用情况，而不会产生额外的计算成本。我们提出了Dapple，这是一个同步训练框架，结合了数据并行性和管道并行的大型DNN模型。它具有新的并行化策略计划者，以解决分区和放置问题，并探讨数据和管道并行性的最佳混合策略。我们还提出了一种新的运行时间调度算法来减少设备内存的使用情况，这是重新计算方法正交的，并且不以培训吞吐量为代价。实验表明，在同步训练方案下，Dapple计划者始终优于Pipedream计划者生成的策略，最高3.23倍，而Dapple Runtime在训练吞吐量的速度上超过了GPIPE，并减少了12％的内存消耗。

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additional computing costs. We propose DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models. It features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategy of data and pipeline parallelism. We also propose a new runtime scheduling algorithm to reduce device memory usage, which is orthogonal to re-computation approach and does not come at the expense of training throughput. Experiments show that DAPPLE planner consistently outperforms strategies generated by PipeDream's planner by up to 3.23x under synchronous training scenarios, and DAPPLE runtime outperforms GPipe by 1.6x speedup of training throughput and reduces the memory consumption of 12% at the same time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题