论文标题
Inception Transformer
Inception Transformer
论文作者
论文摘要
最近的研究表明,变压器具有建立长期依赖性的强大能力,但在捕获主要传达当地信息的高频方面是无能的。为了解决这个问题,我们提出了一种新颖的通用构成变压器,或简称Iformer,在视觉数据中使用高频和低频信息有效地学习了全面的功能。具体而言,我们设计了一个启动混合器,以明确嫁接卷积和最大值的优势,以捕获向变压器的高频信息。与最近的混合框架不同,造型混合器通过通道拆分机制带来更高的效率,以采用平行卷积/最大通道路径,并作为高频混合器作为高频和低频混合器,同时具有模拟散布在广泛频率范围内的歧视性信息的灵活性。考虑到底层在捕获高频细节方面发挥了更大的作用,而顶层在模拟低频全球信息方面有更多的作用,我们进一步引入了频率坡道结构,即逐渐降低了提供给高频混合器的尺寸,并将这些尺寸降低到高频混合器中,并将这些尺寸增加到低频混合器中,这可以有效地交易高频和低频率的组合,并跨越不同的型号。我们在一系列视觉任务上基准测试了iFormer,并展示了它在图像分类,可可检测和ADE20K分割方面取得了令人印象深刻的性能。例如,我们的IFORMER-S在Imagenet-1k上的前1位精度为83.4%,比DEIT-S高3.6%,甚至比仅具有1/4个参数和1/3群的大型型号SWIN-B(83.3%)稍好。代码和型号将在https://github.com/sail-sg/iformer上发布。
Recent studies show that Transformer has strong capability of building long-range dependencies, yet is incompetent in capturing high frequencies that predominantly convey local information. To tackle this issue, we present a novel and general-purpose Inception Transformer, or iFormer for short, that effectively learns comprehensive features with both high- and low-frequency information in visual data. Specifically, we design an Inception mixer to explicitly graft the advantages of convolution and max-pooling for capturing the high-frequency information to Transformers. Different from recent hybrid frameworks, the Inception mixer brings greater efficiency through a channel splitting mechanism to adopt parallel convolution/max-pooling path and self-attention path as high- and low-frequency mixers, while having the flexibility to model discriminative information scattered within a wide frequency range. Considering that bottom layers play more roles in capturing high-frequency details while top layers more in modeling low-frequency global information, we further introduce a frequency ramp structure, i.e. gradually decreasing the dimensions fed to the high-frequency mixer and increasing those to the low-frequency mixer, which can effectively trade-off high- and low-frequency components across different layers. We benchmark the iFormer on a series of vision tasks, and showcase that it achieves impressive performance on image classification, COCO detection and ADE20K segmentation. For example, our iFormer-S hits the top-1 accuracy of 83.4% on ImageNet-1K, much higher than DeiT-S by 3.6%, and even slightly better than much bigger model Swin-B (83.3%) with only 1/4 parameters and 1/3 FLOPs. Code and models will be released at https://github.com/sail-sg/iFormer.