论文标题
轻巧的结构意识到视觉理解的关注
Lightweight Structure-Aware Attention for Visual Understanding
论文作者
论文摘要
注意操作员已被广泛用作视觉理解的基本砖,因为它通过其可调节核提供了一些灵活性。但是,该操作员受到固有的局限性:(1)注意力内核不够歧视,导致了高冗余,并且(2)计算和内存的复杂性在序列长度上是二次的。在本文中,我们提出了一个新颖的注意操作员,称为轻量级结构吸引注意力(LISA),它具有更好的代表能力,具有对数线性复杂性。我们的操作员通过学习结构模式将注意力内核变为更具歧视性。这些结构模式是通过将一组相对位置嵌入(RPE)作为乘法权重编码的,从而提高了注意内核的表示能力。另外,将RPE近似以获得对数线性复杂性。我们的实验和分析表明,所提出的操作员的表现优于自我注意事项和其他现有操作员,从而在Imagenet-1K和其他下游任务上实现了最先进的结果,例如在动力学400上进行视频动作识别,对象检测\&实例检测Coco,以及对ADE-20k的Senicantic sementation。
Attention operator has been widely used as a basic brick in visual understanding since it provides some flexibility through its adjustable kernels. However, this operator suffers from inherent limitations: (1) the attention kernel is not discriminative enough, resulting in high redundancy, and (2) the complexity in computation and memory is quadratic in the sequence length. In this paper, we propose a novel attention operator, called Lightweight Structure-aware Attention (LiSA), which has a better representation power with log-linear complexity. Our operator transforms the attention kernels to be more discriminative by learning structural patterns. These structural patterns are encoded by exploiting a set of relative position embeddings (RPEs) as multiplicative weights, thereby improving the representation power of the attention kernels. Additionally, the RPEs are approximated to obtain log-linear complexity. Our experiments and analyses demonstrate that the proposed operator outperforms self-attention and other existing operators, achieving state-of-the-art results on ImageNet-1K and other downstream tasks such as video action recognition on Kinetics-400, object detection \& instance segmentation on COCO, and semantic segmentation on ADE-20K.