论文标题

可变形卷积的算法 - 硬件共同设计

Algorithm-hardware Co-design for Deformable Convolution

论文作者

Huang, Qijing, Wang, Dequan, Gao, Yizhao, Cai, Yaohui, Dong, Zhen, Wu, Bichen, Keutzer, Kurt, Wawrzynek, John

论文摘要

FPGA提供了一个灵活,有效的平台,以加速快速变化的算法以进行计算机视觉。现有的大多数工作都集中在加速图像分类上,而其他基本视觉问题(包括对象检测和实例细分)尚未得到充分解决。与图像分类相比,检测问题对物体的空间差异更为敏感,因此需要专门的卷积才能汇总空间信息。为了解决这个问题,最近的工作提出了动态变形卷积以增加常规卷积。常规卷积处理图像中所有空间位置的像素的固定网格,而动态可变形的卷积可以访问图像中的任意像素,并且访问模式取决于输入依赖性,并且每个空间位置都会有所不同。这些属性会导致使用现有硬件的输入的效率低下的内存访问。在这项工作中,我们首先研究了嵌入式FPGA SOC的可变形卷积的开销,然后显示一组算法修改的准确性延迟权衡,包括完整的以及深度,固定形状和限量范围。这些修改通常使嵌入式设备的能源效率降低了计算复杂性。然后,我们构建了一个有效的对象检测网络,该网络具有修改后的可变形卷积,并使用最先进的量化方法量化了网络。我们在FPGA上实现统一的硬件引擎,以支持网络中的所有操作。初步实验表明,对于可变形卷积的共同设计优化,几乎没有准确性受到损害,并且可以实现加速。

FPGAs provide a flexible and efficient platform to accelerate rapidly-changing algorithms for computer vision. The majority of existing work focuses on accelerating image classification, while other fundamental vision problems, including object detection and instance segmentation, have not been adequately addressed. Compared with image classification, detection problems are more sensitive to the spatial variance of objects, and therefore, require specialized convolutions to aggregate spatial information. To address this, recent work proposes dynamic deformable convolution to augment regular convolutions. Regular convolutions process a fixed grid of pixels across all the spatial locations in an image, while dynamic deformable convolutions may access arbitrary pixels in the image and the access pattern is input-dependent and varies per spatial location. These properties lead to inefficient memory accesses of inputs with existing hardware. In this work, we first investigate the overhead of the deformable convolution on embedded FPGA SoCs, and then show the accuracy-latency tradeoffs for a set of algorithm modifications including full versus depthwise, fixed-shape, and limited-range. These modifications benefit the energy efficiency for embedded devices in general as they reduce the compute complexity. We then build an efficient object detection network with modified deformable convolutions and quantize the network using state-of-the-art quantization methods. We implement a unified hardware engine on FPGA to support all the operations in the network. Preliminary experiments show that little accuracy is compromised and speedup can be achieved with our co-design optimization for the deformable convolution.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源