I-VIT：有效视觉变压器推断的仅整数量化

论文标题

I-VIT：有效视觉变压器推断的仅整数量化

I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

论文作者

Li, Zhikai, Gu, Qingyi

论文摘要

Vision Transformers（VIT）在各种计算机视觉应用程序上都实现了最先进的性能。但是，这些模型具有相当大的存储空间和计算开销，使其部署和对Edge设备的有效推断充满了挑战。量化是降低模型复杂性的一种有前途的方法，而二元算术管道可以允许量化模型执行有效的整数推断。不幸的是，二元算术是基于卷积神经网络中的均匀性条件，该条件不适用于VIT中的非线性组件，从而使VIT的仅限整数推断出一个空旷的问题。在本文中，我们提出了I-VIT，即VIT的仅整数量化方案，使VIT能够使用整数算术和位移动执行整个推理的计算图，而无需任何浮点算术算术。在I-VIT中，线性操作（例如，矩阵和密集）遵循具有二元算术的仅整数管道，而非线性操作（例如Softmax，gelu和Layernorm）近似于提议的轻质量化算术算术方法。更具体地说，I-Vit应用了提出的ShiftMax和ShiftGelu，它们旨在使用整数位移动来近似相应的浮点操作。我们在各种基准模型上评估了I-VIT，结果表明，仅整数INT8量化具有与完整精确（FP）基线相当（甚至更高）的精度。此外，我们在GPU的整数算术单元上使用TVM进行实用的硬件部署，与FP模型相比，实现了3.72 $ \ sim $ 4.11 $ \ times $推理的速度。 Pytorch和TVM的代码均在https://github.com/zkkli/i-vit上发布。

Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. However, these models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging. Quantization is a promising approach to reducing model complexity, and the dyadic arithmetic pipeline can allow the quantized models to perform efficient integer-only inference. Unfortunately, dyadic arithmetic is based on the homogeneity condition in convolutional neural networks, which is not applicable to the non-linear components in ViTs, making integer-only inference of ViTs an open issue. In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting, and without any floating-point arithmetic. In I-ViT, linear operations (e.g., MatMul and Dense) follow the integer-only pipeline with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and LayerNorm) are approximated by the proposed light-weight integer-only arithmetic methods. More specifically, I-ViT applies the proposed Shiftmax and ShiftGELU, which are designed to use integer bit-shifting to approximate the corresponding floating-point operations. We evaluate I-ViT on various benchmark models and the results show that integer-only INT8 quantization achieves comparable (or even slightly higher) accuracy to the full-precision (FP) baseline. Furthermore, we utilize TVM for practical hardware deployment on the GPU's integer arithmetic units, achieving 3.72$\sim$4.11$\times$ inference speedup compared to the FP model. Code of both Pytorch and TVM is released at https://github.com/zkkli/I-ViT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题