论文标题
分布式幻灯片:通过模型并行性和稀疏性,可以在低带宽和简单的CPU群体上训练大型神经网络
Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity
论文作者
论文摘要
超过70%的云计算被支付,但闲置。这些闲置计算中的很大一部分是便宜的CPU,其中很少有核心,这些核心在不那么繁忙的时间内没有使用。本文旨在使这些CPU周期能够训练重量级AI模型。我们的目标是反对主流框架,该框架的重点是利用昂贵的专用超高带宽互连来解决分布式神经网络培训中的通信瓶颈。本文提出了一个分布式的平行训练框架,该框架使互联网带宽低的小型CPU群集培训大型神经网络。我们基于幻灯片算法引入的自适应稀疏训练框架。通过仔细部署分布式节点上的稀疏性,我们证明了比大多数商业软件背后的主要发动机HOROVOD更快的模型平行训练的几个数量级。我们表明,由于稀疏性,我们可以在简单的4-16个核心CPU节点上通过基本的低带宽互连连接的简单4-16核CPU节点上的数十亿个参数模型训练。此外,培训时间与一些最好的硬件加速器相提并论。
More than 70% of cloud computing is paid for but sits idle. A large fraction of these idle compute are cheap CPUs with few cores that are not utilized during the less busy hours. This paper aims to enable those CPU cycles to train heavyweight AI models. Our goal is against mainstream frameworks, which focus on leveraging expensive specialized ultra-high bandwidth interconnect to address the communication bottleneck in distributed neural network training. This paper presents a distributed model-parallel training framework that enables training large neural networks on small CPU clusters with low Internet bandwidth. We build upon the adaptive sparse training framework introduced by the SLIDE algorithm. By carefully deploying sparsity over distributed nodes, we demonstrate several orders of magnitude faster model parallel training than Horovod, the main engine behind most commercial software. We show that with reduced communication, due to sparsity, we can train close to a billion parameter model on simple 4-16 core CPU nodes connected by basic low bandwidth interconnect. Moreover, the training time is at par with some of the best hardware accelerators.