快速实施移动设备的4位卷积神经网络

论文标题

快速实施移动设备的4位卷积神经网络

Fast Implementation of 4-bit Convolutional Neural Networks for Mobile Devices

论文作者

Trusov, Anton, Limonova, Elena, Slugin, Dmitry, Nikolaev, Dmitry, Arlazarov, Vladimir V.

论文摘要

量化的低精度神经网络非常流行，因为它们需要较少的计算资源来推理，并且可以提供高性能，这对于实时和嵌入式识别系统至关重要。但是，它们的优势对于FPGA和ASIC设备显而易见，而通用处理器体系结构并不总是能够有效地执行低位整数计算。用于移动中央处理器的最常用的低精确神经网络模型是8位量化网络。但是，在许多情况下，可以将更少的位用于权重和激活，而唯一的问题是难以有效实施。我们介绍了4位矩阵乘法的有效实现，以量化神经网络，并在移动ARM处理器上执行时间测量。与标准浮点乘法相比，它显示了2.9倍的速度，并且比8位量化的速度快1.5倍。我们还展示了一个4位量化的神经网络，可在MIDV-500数据集上进行OCR识别。 4位量化可提供95.0％的精度和48％的总体推断速度，而8位量化网络的精度为95.4％，加速度为39％。结果表明，4位量化非常适合移动设备，可产生足够的精度和较低的推理时间。

Quantized low-precision neural networks are very popular because they require less computational resources for inference and can provide high performance, which is vital for real-time and embedded recognition systems. However, their advantages are apparent for FPGA and ASIC devices, while general-purpose processor architectures are not always able to perform low-bit integer computations efficiently. The most frequently used low-precision neural network model for mobile central processors is an 8-bit quantized network. However, in a number of cases, it is possible to use fewer bits for weights and activations, and the only problem is the difficulty of efficient implementation. We introduce an efficient implementation of 4-bit matrix multiplication for quantized neural networks and perform time measurements on a mobile ARM processor. It shows 2.9 times speedup compared to standard floating-point multiplication and is 1.5 times faster than 8-bit quantized one. We also demonstrate a 4-bit quantized neural network for OCR recognition on the MIDV-500 dataset. 4-bit quantization gives 95.0% accuracy and 48% overall inference speedup, while an 8-bit quantized network gives 95.4% accuracy and 39% speedup. The results show that 4-bit quantization perfectly suits mobile devices, yielding good enough accuracy and low inference time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题