有效地执行量化深度学习模型：编译器方法

论文标题

有效地执行量化深度学习模型：编译器方法

Efficient Execution of Quantized Deep Learning Models: A Compiler Approach

论文作者

Jain, Animesh, Bhattacharya, Shoubhik, Masuda, Masahiro, Sharma, Vin, Wang, Yida

论文摘要

越来越多的应用程序使用深度学习模型实现了预测功能，这些模型需要大量使用计算和内存。提高资源效率的一种流行技术是8位整数量化，其中32位浮点数（FP32）使用较短的8位整数数字表示。尽管深度学习框架（例如Tensorflow，Tflite，MXNET和Pytorch）使开发人员能够仅量度较小的准确性量化模型，但它们不适合在各种硬件平台上执行量化的模型。例如，对TFLITE进行了优化以在ARM CPU边缘设备上运行推断，但对Intel CPU和NVIDIA GPU没有有效的支持。在本文中，我们通过提出增强的编译器方法来解决在不同硬件平台上执行量化深度学习模型的挑战。诸如Apache TVM之类的深度学习编译器可以从各种目标上的各种框架中有效执行模型。但是，当今许多深度学习编译器主要是为FP32计算而设计的，无法优化预量化的INT8模型。为了解决此问题，我们创建了一个称为量化神经网络（QNN）的新方言，该方言通过量化上下文扩展了编译器的内部表示。借助此量化上下文，编译器可以在各种硬件平台上为预量化模型生成有效的代码。正如Apache TVM所实施的那样，我们观察到，QNN的深度学习编译器在Intel Xeon Cascade Cpus上实现2.35倍，2.15倍，1.35倍和1.40倍的速度，NVIDIA TESLA TESLA TESLA T4 T4 GPU，ARM Raspberry PI3和PI4分别针对良好的稳定性，并分别与稳定性相对效果，并与Felt-p32 corment for Felthans and Offication-Fp32和Fp32相机，并分别解决方案。

A growing number of applications implement predictive functions using deep learning models, which require heavy use of compute and memory. One popular technique for increasing resource efficiency is 8-bit integer quantization, in which 32-bit floating point numbers (fp32) are represented using shorter 8-bit integer numbers. Although deep learning frameworks such as TensorFlow, TFLite, MXNet, and PyTorch enable developers to quantize models with only a small drop in accuracy, they are not well suited to execute quantized models on a variety of hardware platforms. For example, TFLite is optimized to run inference on ARM CPU edge devices but it does not have efficient support for Intel CPUs and Nvidia GPUs. In this paper, we address the challenges of executing quantized deep learning models on diverse hardware platforms by proposing an augmented compiler approach. A deep learning compiler such as Apache TVM can enable the efficient execution of model from various frameworks on various targets. Many deep learning compilers today, however, are designed primarily for fp32 computation and cannot optimize a pre-quantized INT8 model. To address this issue, we created a new dialect called Quantized Neural Network (QNN) that extends the compiler's internal representation with a quantization context. With this quantization context, the compiler can generate efficient code for pre-quantized models on various hardware platforms. As implemented in Apache TVM, we observe that the QNN-augmented deep learning compiler achieves speedups of 2.35x, 2.15x, 1.35x and 1.40x on Intel Xeon Cascade Lake CPUs, Nvidia Tesla T4 GPUs, ARM Raspberry Pi3 and Pi4 respectively against well optimized fp32 execution, and comparable performance to the state-of-the-art framework-specific solutions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题