使用新的执行算法训练具有恒定内存的大型神经网络

论文标题

使用新的执行算法训练具有恒定内存的大型神经网络

Training Large Neural Networks with Constant Memory using a New Execution Algorithm

论文作者

Pudipeddi, Bharadwaj, Mesmakhosroshahi, Maral, Xi, Jinwen, Bharadwaj, Sujeeth

论文摘要

广泛流行的基于变压器的NLP模型（例如BERT和Turing-NLG）具有巨大的能力，趋势趋于数十亿个参数。当前的执行方法需要蛮力资源，例如HBM设备和数据并行性的高速互连性。在本文中，我们介绍了一种称为L2L（层到层）的新的继电器式执行技术，在任何给定时刻，设备内存主要仅在执行层（S）的足迹的情况下填充。该模型位于连接到CPU或FPGA的DRAM内存中，我们称为急切参数服务器（EPS）。为了克服与EPS的穿梭参数的带宽问题，该模型每次都会在许多微批次上执行一层，而不是整个模型上的Minibatches的常规方法。使用16GB V100设备实现L2L，用于运行它的Bert-Large，其设备批次尺寸高达256。我们的结果显示，与最先进的基线相比，内存降低了45％，吞吐量增加了40％。 L2L还能够在具有单个16GB V100和512GB CPU内存的机器上拟合高达500亿个参数，而无需任何模型分区。 L2L尺度到任意深度，使研究人员能够在负担得起的设备上开发，这是使AI民主化的重要一步。通过在主机EPS中运行优化器，我们显示了一种新形式的混合精度，以进行更快的吞吐量和收敛。此外，EPS通过在迭代中改变层来实现动态神经体系结构方法。最后，我们还提出并展示了L2L的持续记忆变化，并提出了未来的增强。这项工作首先是在GPU上进行的，但也针对所有高TFLOPS/WATT加速器。

Widely popular transformer-based NLP models such as BERT and Turing-NLG have enormous capacity trending to billions of parameters. Current execution methods demand brute-force resources such as HBM devices and high speed interconnectivity for data parallelism. In this paper, we introduce a new relay-style execution technique called L2L (layer-to-layer) where at any given moment, the device memory is primarily populated only with the executing layer(s)'s footprint. The model resides in the DRAM memory attached to either a CPU or an FPGA as an entity we call eager param-server (EPS). To overcome the bandwidth issues of shuttling parameters to and from EPS, the model is executed a layer at a time across many micro-batches instead of the conventional method of minibatches over whole model. L2L is implemented using 16GB V100 devices for BERT-Large running it with a device batch size of up to 256. Our results show 45% reduction in memory and 40% increase in the throughput compared to the state-of-the-art baseline. L2L is also able to fit models up to 50 Billion parameters on a machine with a single 16GB V100 and 512GB CPU memory and without requiring any model partitioning. L2L scales to arbitrary depth allowing researchers to develop on affordable devices which is a big step toward democratizing AI. By running the optimizer in the host EPS, we show a new form of mixed precision for faster throughput and convergence. In addition, the EPS enables dynamic neural architecture approaches by varying layers across iterations. Finally, we also propose and demonstrate a constant memory variation of L2L and we propose future enhancements. This work has been performed on GPUs first, but also targeted towards all high TFLOPS/Watt accelerators.

下载PDF全文

下载文献需遵守相关版权规定

论文标题