DS同步：通过分开和剃须的同步来解决网络瓶颈，用于分布式DNN培训

论文标题

DS同步：通过分开和剃须的同步来解决网络瓶颈，用于分布式DNN培训

DS-Sync: Addressing Network Bottlenecks with Divide-and-Shuffle Synchronization for Distributed DNN Training

论文作者

Wang, Weiyan, Zhang, Cengguang, Yang, Liu, Chen, Kai, Tan, Kun

论文摘要

批量同步平行（BSP）是当今生产簇中分布式DNN训练的事实范式。但是，由于全球同步性质，其性能可能会受到静态拓扑异质性或动态带宽论点引起的网络瓶颈的显着影响。现有的解决方案，无论是系统级优化加强BSP（例如环或分层全降低）还是算法优化，取代BSP（例如，放宽全球障碍的ASP或SSP）的算法并不能完全解决该问题，因为它们仍然可能会遭受效率低下或风险转化的障碍或风险转化的影响。在本文中，我们提出了一种新颖的分裂和剃须同步（DS-Sync），以实现沟通效率，而无需牺牲分布式DNN训练的收敛精度。核心，通过考虑网络瓶颈，DS-Sync通过将工人分为非重叠组来提高沟通效率，以无瓶颈的方式独立同步。同时，它通过在不同群体之间迭代地洗牌以确保全球共识来保持融合准确性。从理论上讲，我们证明DS-Sync在非凸和平滑条件（如DNN）中正确收敛。我们进一步实施了DS-Sync并将其与Pytorch集成在一起，我们的测试台实验表明，DS-Sync可以在端到端培训时间上获得最高$ 94 \％$ $的改进，同时保持相同的精度。

Bulk synchronous parallel (BSP) is the de-facto paradigm for distributed DNN training in today's production clusters. However, due to the global synchronization nature, its performance can be significantly influenced by network bottlenecks caused by either static topology heterogeneity or dynamic bandwidth contentions. Existing solutions, either system-level optimizations strengthening BSP (e.g., Ring or Hierarchical All-reduce) or algorithmic optimizations replacing BSP (e.g., ASP or SSP, which relax the global barriers), do not completely solve the problem, as they may still suffer from communication inefficiency or risk convergence inaccuracy. In this paper, we present a novel divide-and-shuffle synchronization (DS-Sync) to realize communication efficiency without sacrificing convergence accuracy for distributed DNN training. At its heart, by taking into account the network bottlenecks, DS-Sync improves communication efficiency by dividing workers into non-overlap groups to synchronize independently in a bottleneck-free manner. Meanwhile, it maintains convergence accuracy by iteratively shuffling workers among different groups to ensure a global consensus. We theoretically prove that DS-Sync converges properly in non-convex and smooth conditions like DNN. We further implement DS-Sync and integrate it with PyTorch, and our testbed experiments show that DS-Sync can achieve up to $94\%$ improvements on the end-to-end training time with existing solutions while maintaining the same accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题