在语言预训练中重新思考位置编码

论文标题

在语言预训练中重新思考位置编码

Rethinking Positional Encoding in Language Pre-training

论文作者

Ke, Guolin, He, Di, Liu, Tie-Yan

论文摘要

在这项工作中，我们研究了语言预训练（例如BERT）中使用的位置编码方法，并在现有配方中确定了几个问题。首先，我们表明，在绝对位置编码中，应用于位置嵌入和单词嵌入的加法操作在两个异构信息资源之间带来了混杂的相关性。它可能会带来不必要的随机性，并进一步限制模型的表现力。其次，我们质疑是否处理符号\ texttt {[cls]}的位置是否与其他单词相同，这是一个合理的设计，考虑到其在下游任务中其特殊作用（整个句子的表示）。从上面的分析中，我们提出了一种新的位置编码方法，称为\ textbf {t}用\ textbf {u} ntied \ textbf {p} inositional \ textbf {e textbf {e} ncoding（tupe）。在自我发场模块中，Tupe通过不同的参数化分别计算上下文相关性和位置相关性，然后将它们添加在一起。该设计消除了异质嵌入的混合和嘈杂的相关性，并通过使用不同的投影矩阵提供了更具表现力的能力。此外，Tupe从其他位置解开\ texttt {[cls]}符号，从而更容易从所有位置捕获信息。对胶水基准的广泛实验和消融研究证明了该方法的有效性。代码和模型在https://github.com/guolinke/tupe上发布。

In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second, we question whether treating the position of the symbol \texttt{[CLS]} the same as other words is a reasonable design, considering its special role (the representation of the entire sentence) in the downstream tasks. Motivated from above analysis, we propose a new positional encoding method called \textbf{T}ransformer with \textbf{U}ntied \textbf{P}ositional \textbf{E}ncoding (TUPE). In the self-attention module, TUPE computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together. This design removes the mixed and noisy correlations over heterogeneous embeddings and offers more expressiveness by using different projection matrices. Furthermore, TUPE unties the \texttt{[CLS]} symbol from other positions, making it easier to capture information from all positions. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of the proposed method. Codes and models are released at https://github.com/guolinke/TUPE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题