论文标题
您的变压器可能不像您期望的那样强大
Your Transformer May Not be as Powerful as You Expect
论文作者
论文摘要
相对位置编码(RPE)编码任何一对令牌之间的相对距离,是对原始变压器的最成功修改之一。据我们所知,对基于RPE的变压器的理论理解在很大程度上没有探索。在这项工作中,我们数学上分析了基于RPE的变压器的功率,即模型是否能够近似任何连续的序列到序列函数。可以自然地认为答案是肯定的 - 基于RPE的变压器是通用函数近似器。但是,我们通过显示存在连续的序列到序列函数来表达负面结果,无论神经网络多么深和宽,基于RPE的变压器都无法近似。一个关键原因之一是,大多数RPE都放在SoftMax的注意力中,该注意总是会产生正确的随机矩阵。这限制了网络在RPE中捕获位置信息并限制其容量。为了克服问题并使模型更强大,我们首先提出足够的条件,使基于RPE的变压器实现通用函数近似。通过理论指导,我们开发了一个新型的注意模块,称为普遍RPE(URPE)注意,该模块满足条件。因此,相应的基于URPE的变压器成为通用函数近似器。涵盖典型体系结构和任务的广泛实验表明,我们的模型是参数有效的,并且可以在广泛的应用中实现与强基础相比的卓越性能。该代码将在https://github.com/lsj2408/urpe上公开提供。
Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications. The code will be made publicly available at https://github.com/lsj2408/URPE.