论文标题
带有甜甜圈内核的图案注意变压器
Pattern Attention Transformer with Doughnut Kernel
论文作者
论文摘要
我们在本文中介绍了一个新的架构,即由新的甜甜圈内核组成的图案注意变压器(PAT)。与NLP字段中的令牌相比,计算机视觉中的变压器存在处理图像中像素的高分辨率的问题。在VIT中,将图像切成方形贴片。随着VIT的随访,Swin Transformer提出了一个额外的移动步骤,以减少固定边界的存在,这也将“两个连接的SWIN Transformer块”作为模型的最小单位。我们的Donut内核继承了补丁/窗口的想法,进一步增强了补丁的设计。它用两种类型的区域替代了线条边界:传感器和更新,这是基于自我注意力的理解(命名为QKVA网格)。甜甜圈内核还带来了一个有关广场之外内核形状的新话题。为了验证其在图像分类上的性能,PAT是使用常规八角形甜甜圈内核的变压器块设计的。它的架构更轻:最小图案注意力层仅是每个阶段的一个。在相似的计算复杂性下,其在Imagenet 1K上的性能达到了较高的吞吐量(+10%)并超过Swin Transformer(+0.8 ACC1)。
We present in this paper a new architecture, the Pattern Attention Transformer (PAT), that is composed of the new doughnut kernel. Compared with tokens in the NLP field, Transformer in computer vision has the problem of handling the high resolution of pixels in images. In ViT, an image is cut into square-shaped patches. As the follow-up of ViT, Swin Transformer proposes an additional step of shifting to decrease the existence of fixed boundaries, which also incurs 'two connected Swin Transformer blocks' as the minimum unit of the model. Inheriting the patch/window idea, our doughnut kernel enhances the design of patches further. It replaces the line-cut boundaries with two types of areas: sensor and updating, which is based on the comprehension of self-attention (named QKVA grid). The doughnut kernel also brings a new topic about the shape of kernels beyond square. To verify its performance on image classification, PAT is designed with Transformer blocks of regular octagon shape doughnut kernels. Its architecture is lighter: the minimum pattern attention layer is only one for each stage. Under similar complexity of computation, its performances on ImageNet 1K reach higher throughput (+10%) and surpass Swin Transformer (+0.8 acc1).