转移和挤压的8位浮点格式，用于对深度神经网络的低精度培训

论文标题

转移和挤压的8位浮点格式，用于对深度神经网络的低精度培训

Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks

论文作者

Cambier, Léopold, Bhiwandiwalla, Anahita, Gong, Ting, Nekuii, Mehran, Elibol, Oguz H, Tang, Hanlin

论文摘要

在进行快速迭代的同时，使用大量参数进行培训是开发更好地执行深度神经网络（DNN）模型的策略和趋势。这需要增加记忆足迹和培训的计算要求。在这里，我们介绍了一种使用8位浮点（FP8）数字训练深神网络的新方法。降低的位精度允许更大的有效内存和提高的计算速度。我们命名此方法移动并挤压了FP8（S2FP8）。我们表明，与以前的8位精度训练方法不同，该建议的方法可用于代表性模型：Resnet-50，Transformer和NCF。该方法可以保持模型精度，而无需微调损耗缩放参数或保持某些层单一精确度。我们介绍了DNN张量的两个可学习的统计数据 - 移动和挤压因子，用于最佳地调整张量的范围8位，从而最大程度地减少了由于量化而导致的信息损失。

Training with larger number of parameters while keeping fast iterations is an increasingly adopted strategy and trend for developing better performing Deep Neural Network (DNN) models. This necessitates increased memory footprint and computational requirements for training. Here we introduce a novel methodology for training deep neural networks using 8-bit floating point (FP8) numbers. Reduced bit precision allows for a larger effective memory and increased computational speed. We name this method Shifted and Squeezed FP8 (S2FP8). We show that, unlike previous 8-bit precision training methods, the proposed method works out-of-the-box for representative models: ResNet-50, Transformer and NCF. The method can maintain model accuracy without requiring fine-tuning loss scaling parameters or keeping certain layers in single precision. We introduce two learnable statistics of the DNN tensors - shifted and squeezed factors that are used to optimally adjust the range of the tensors in 8-bits, thus minimizing the loss in information due to quantization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题