微控制器的有效神经网络部署

论文标题

微控制器的有效神经网络部署

Efficient Neural Network Deployment for Microcontroller

论文作者

Unlu, Hasan

论文摘要

神经网络的边缘计算变得越来越重要，特别是对于低功率应用和离线设备。为此，释放了Tensorflow Lite和Pytorch手机。但是它们主要支持移动设备，而不是微控制器级别。微控制器支持现在是一个新兴区域。有许多方法可以减少网络大小和计算负载，例如修剪，二进制和层操作，即操作员重新排序。本文将通过两个新颖的优化建议探索和推广微控制器的卷积神经网络部署，可提供记忆节省和计算2D卷积的效率以及完全连接的层。如果步幅大于或等于合并的内核大小，则第一个是原地最大通行。第二个优化是在层之间使用乒乓球缓冲液来大大减少内存消耗。内存节省和性能将与为ARM Cortex-M CPU开发的CMSIS-NN框架进行比较。最终的目的是开发具有训练有素的网络权重的工具，并将其变成C/C ++的优化推理引擎（正向通行），以用于低内存（KOLOBYTE级别）和有限的计算能力微控制器。

Edge computing for neural networks is getting important especially for low power applications and offline devices. TensorFlow Lite and PyTorch Mobile were released for this purpose. But they mainly support mobile devices instead of microcontroller level yet. Microcontroller support is an emerging area now. There are many approaches to reduce network size and compute load like pruning, binarization and layer manipulation i.e. operator reordering. This paper is going to explore and generalize convolution neural network deployment for microcontrollers with two novel optimization proposals offering memory saving and compute efficiency in 2D convolutions as well as fully connected layers. The first one is in-place max-pooling, if the stride is greater than or equal to pooling kernel size. The second optimization is to use ping-pong buffers between layers to reduce memory consumption significantly. The memory savings and performance will be compared with CMSIS-NN framework developed for ARM Cortex-M CPUs. The final purpose is to develop a tool consuming PyTorch model with trained network weights, and it turns into an optimized inference engine(forward pass) in C/C++ for low memory(kilobyte level) and limited computing capable microcontrollers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题