HETSEQ：分布式GPU关于异质基础设施的培训

论文标题

HETSEQ：分布式GPU关于异质基础设施的培训

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

论文作者

Ding, Yifan, Botzer, Nicholas, Weninger, Tim

论文摘要

Pytorch和TensorFlow等现代深度学习系统能够在分布式基础架构上使用数十亿（或万亿）参数的巨大模型训练。这些系统要求内部节点具有相同的内存能力和计算性能。不幸的是，大多数组织，尤其是大学，都采用零碎的方法来购买计算机系统，导致异质基础架构，该基础架构无法用于计算大型模型。目前的工作描述了Hetseq，这是一个由流行的Pytorch软件包进行了改编的软件包，该软件包提供了在异质基础架构上训练大型神经网络模型的能力。使用变压器翻译和BERT语言模型的实验表明，HetSeq在异质系统上进行了尺度。 HetSeq可以轻松地扩展到其他模型，例如图像分类。带有支持文档的软件包可在https://github.com/yifding/hetseq上公开获取。

Modern deep learning systems like PyTorch and Tensorflow are able to train enormous models with billions (or trillions) of parameters on a distributed infrastructure. These systems require that the internal nodes have the same memory capacity and compute performance. Unfortunately, most organizations, especially universities, have a piecemeal approach to purchasing computer systems resulting in a heterogeneous infrastructure, which cannot be used to compute large models. The present work describes HetSeq, a software package adapted from the popular PyTorch package that provides the capability to train large neural network models on heterogeneous infrastructure. Experiments with transformer translation and BERT language model shows that HetSeq scales over heterogeneous systems. HetSeq can be easily extended to other models like image classification. Package with supported document is publicly available at https://github.com/yifding/hetseq.

下载PDF全文

下载文献需遵守相关版权规定

论文标题