位置预测是有效的预处理策略

论文标题

位置预测是有效的预处理策略

Position Prediction as an Effective Pretraining Strategy

论文作者

Zhai, Shuangfei, Jaitly, Navdeep, Ramapuram, Jason, Busbridge, Dan, Likhomanenko, Tatiana, Cheng, Joseph Yitan, Talbott, Walter, Huang, Chen, Goh, Hanlin, Susskind, Joshua

论文摘要

由于具有强大的代表性，变形金刚在包括自然语言处理（NLP），计算机视觉和语音识别在内的广泛应用中越来越受欢迎。但是，利用这种代表性的能力有效地需要大量的数据，强大的正则化或两者兼具来减轻过度拟合。最近，基于掩盖的自动编码器的自制预处理策略使变压器的功能解锁了，这些策略依赖于直接或与未掩盖的内容相反地重建蒙面的输入。这种预训练的策略已在NLP的BERT模型，语音中的WAV2VEC模型中使用，最近在视觉中的MAE模型中，该模型迫使该模型使用自动编码相关的目标来了解输入不同部分的内容之间的关系。在本文中，我们提出了一种小说但令人惊讶的简单替代内容，以预测内容的位置，而无需为其提供位置信息。这样做需要变压器仅凭其内容就可以理解输入不同部分之间的位置关系。这相当于有效的实现，其中借口任务是每个输入令牌所有可能位置之间的分类问题。我们在视力和语音基准上进行了实验，我们的方法对强有力的监督训练基准进行了改进，并且与现代无监督/自我监督的预处理方法相媲美。我们的方法还可以使经过训练的变压器在没有位置嵌入的情况下训练，以胜过接受完整位置信息的训练的变压器。

Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing this representational capacity effectively requires a large amount of data, strong regularization, or both, to mitigate overfitting. Recently, the power of the Transformer has been unlocked by self-supervised pretraining strategies based on masked autoencoders which rely on reconstructing masked inputs, directly, or contrastively from unmasked content. This pretraining strategy which has been used in BERT models in NLP, Wav2Vec models in Speech and, recently, in MAE models in Vision, forces the model to learn about relationships between the content in different parts of the input using autoencoding related objectives. In this paper, we propose a novel, but surprisingly simple alternative to content reconstruction~-- that of predicting locations from content, without providing positional information for it. Doing so requires the Transformer to understand the positional relationships between different parts of the input, from their content alone. This amounts to an efficient implementation where the pretext task is a classification problem among all possible positions for each input token. We experiment on both Vision and Speech benchmarks, where our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods. Our method also enables Transformers trained without position embeddings to outperform ones trained with full position information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题