Inception Transformer

论文标题

Inception Transformer

论文作者

Si, Chenyang, Yu, Weihao, Zhou, Pan, Zhou, Yichen, Wang, Xinchao, Yan, Shuicheng

论文摘要

最近的研究表明，变压器具有建立长期依赖性的强大能力，但在捕获主要传达当地信息的高频方面是无能的。为了解决这个问题，我们提出了一种新颖的通用构成变压器，或简称Iformer，在视觉数据中使用高频和低频信息有效地学习了全面的功能。具体而言，我们设计了一个启动混合器，以明确嫁接卷积和最大值的优势，以捕获向变压器的高频信息。与最近的混合框架不同，造型混合器通过通道拆分机制带来更高的效率，以采用平行卷积/最大通道路径，并作为高频混合器作为高频和低频混合器，同时具有模拟散布在广泛频率范围内的歧视性信息的灵活性。考虑到底层在捕获高频细节方面发挥了更大的作用，而顶层在模拟低频全球信息方面有更多的作用，我们进一步引入了频率坡道结构，即逐渐降低了提供给高频混合器的尺寸，并将这些尺寸降低到高频混合器中，并将这些尺寸增加到低频混合器中，这可以有效地交易高频和低频率的组合，并跨越不同的型号。我们在一系列视觉任务上基准测试了iFormer，并展示了它在图像分类，可可检测和ADE20K分割方面取得了令人印象深刻的性能。例如，我们的IFORMER-S在Imagenet-1k上的前1位精度为83.4％，比DEIT-S高3.6％，甚至比仅具有1/4个参数和1/3群的大型型号SWIN-B（83.3％）稍好。代码和型号将在https://github.com/sail-sg/iformer上发布。

Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e. gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at https://github.com/sail-sg/iFormer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题