与视觉MLP相对位置编码的交叉关系的参数化

论文标题

与视觉MLP相对位置编码的交叉关系的参数化

Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP

论文作者

Wang, Zhicai, Hao, Yanbin, Gao, Xingyu, Zhang, Hao, Wang, Shuo, Mu, Tingting, He, Xiangnan

论文摘要

视觉多层感知器（MLP）在计算机视觉任务中表现出了有希望的表现，并成为CNN和Vision Transformers的主要竞争者。他们使用代币混合层来捕获交叉相互作用，而不是变形金刚使用的多头自我发挥机制。但是，大量参数化的令牌混合层自然缺乏捕获局部信息和多粒性非本地关系的机制，因此它们的判别能力受到限制。为了解决这个问题，我们提出了一个新的位置空间门控单元（POSGU）。它利用经典相对位置编码（RPE）中使用的注意力公式，以有效地编码令牌混合的交叉关系。它可以成功地将视觉MLP的当前二次参数复杂性$ O（n^2）$ $ O（n^2）$（n）$（n）$和$ O（1）$。我们实验了两种RPE机制，并进一步提出了一个小组扩展，以实现多种环境，以提高其表现力。然后，它们是一种新型视觉MLP的关键构建块，称为POSMLP。我们通过进行彻底的实验来评估所提出的方法的有效性，证明参数复杂性的提高或可比性的性能得到了改善或可比性。例如，对于在ImagEnet1k上训练的模型，我们实现了从72.14 \％\％\％\％的绩效提高，并且可学习的参数从$ 194M $ $ $ $ $降至1820万美元。可以在https://github.com/zhicaiwww/posmlp上找到代码。

Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks, and become the main competitor of CNNs and vision Transformers. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. However, the heavily parameterized token-mixing layers naturally lack mechanisms to capture local information and multi-granular non-local relations, thus their discriminative power is restrained. To tackle this issue, we propose a new positional spacial gating unit (PoSGU). It exploits the attention formulations used in the classical relative positional encoding (RPE), to efficiently encode the cross-token relations for token mixing. It can successfully reduce the current quadratic parameter complexity $O(N^2)$ of vision MLPs to $O(N)$ and $O(1)$. We experiment with two RPE mechanisms, and further propose a group-wise extension to improve their expressive power with the accomplishment of multi-granular contexts. These then serve as the key building blocks of a new type of vision MLP, referred to as PosMLP. We evaluate the effectiveness of the proposed approach by conducting thorough experiments, demonstrating an improved or comparable performance with reduced parameter complexity. For instance, for a model trained on ImageNet1K, we achieve a performance improvement from 72.14\% to 74.02\% and a learnable parameter reduction from $19.4M$ to $18.2M$. Code could be found at https://github.com/Zhicaiwww/PosMLP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题