浏览和聚焦网络，用于动态视觉识别

论文标题

浏览和聚焦网络，用于动态视觉识别

Glance and Focus Networks for Dynamic Visual Recognition

论文作者

Huang, Gao, Wang, Yulin, Lv, Kangchen, Jiang, Haojun, Huang, Wenhui, Qi, Pengfei, Song, Shiji

论文摘要

空间冗余在视觉识别任务中广泛存在，即图像或视频框架中的判别特征通常仅对应于像素的子集，而其余区域与手头的任务无关。因此，在时间和空间消耗方面，处理所有像素的静态模型会导致相当大的冗余。在本文中，我们将图像识别问题提出为一个顺序的粗到精细特征学习过程，模仿了人类的视觉系统。具体而言，拟议的目光和焦点网络（GFNET）首先提取以低分辨率量表的输入图像的快速全局表示，然后战略性地关注一系列显着（小）区域，以学习更优质的特征。顺序过程自然促进了测试时间的适应性推断，因为一旦模型对其预测充分自信，可以终止它，从而避免了进一步的冗余计算。值得注意的是，在我们的模型中找到判别区域的问题是作为一项强化学习任务的提法，因此除了分类标签以外，不需要其他手动注释。 GFNET是一般且灵活的，因为它与任何现成的骨干型号（例如Mobilenets，ExtricNets和TSM）兼容，可以方便地将其作为功能提取器部署。关于各种图像分类和视频识别任务以及各种骨干模型的广泛实验证明了我们方法的显着效率。例如，它减少了iPhone XS Max上高效的Mobilenet-V3的平均潜伏期，而无需牺牲准确性。代码和预培训模型可在https://github.com/blackfeather-wang/gfnet-pytorch上找到。

Spatial redundancy widely exists in visual recognition tasks, i.e., discriminative features in an image or video frame usually correspond to only a subset of pixels, while the remaining regions are irrelevant to the task at hand. Therefore, static models which process all the pixels with an equal amount of computation result in considerable redundancy in terms of time and space consumption. In this paper, we formulate the image recognition problem as a sequential coarse-to-fine feature learning process, mimicking the human visual system. Specifically, the proposed Glance and Focus Network (GFNet) first extracts a quick global representation of the input image at a low resolution scale, and then strategically attends to a series of salient (small) regions to learn finer features. The sequential process naturally facilitates adaptive inference at test time, as it can be terminated once the model is sufficiently confident about its prediction, avoiding further redundant computation. It is worth noting that the problem of locating discriminant regions in our model is formulated as a reinforcement learning task, thus requiring no additional manual annotations other than classification labels. GFNet is general and flexible as it is compatible with any off-the-shelf backbone models (such as MobileNets, EfficientNets and TSM), which can be conveniently deployed as the feature extractor. Extensive experiments on a variety of image classification and video recognition tasks and with various backbone models demonstrate the remarkable efficiency of our method. For example, it reduces the average latency of the highly efficient MobileNet-V3 on an iPhone XS Max by 1.3x without sacrificing accuracy. Code and pre-trained models are available at https://github.com/blackfeather-wang/GFNet-Pytorch.

下载PDF全文

下载文献需遵守相关版权规定

论文标题