关于双向在语言模型预训练中的作用

论文标题

关于双向在语言模型预训练中的作用

On the Role of Bidirectionality in Language Model Pre-Training

论文作者

Artetxe, Mikel, Du, Jingfei, Goyal, Naman, Zettlemoyer, Luke, Stoyanov, Ves

论文摘要

在语言模型预训练方面的先前工作探索了不同的架构和学习目标，但是数据，超参数和评估的差异使得有原则性的比较变得困难。在这项工作中，我们专注于双向性，这是区分现有方法的关键因素，并全面研究了其在下一代币预测，文本填充，零射击启动和微调中的作用。我们提出了一个新的框架，该框架概括了先前的方法，包括GPT等完全单向模型，BERT等完全双向模型以及CM3和前缀LM等混合模型。我们的框架区分了两个双向概念（双向上下文和双向关注），并使我们能够分别控制它们。我们发现，最佳配置在很大程度上取决于应用程序（例如，双向关注对微调和填充是有益的，但对接下来的令牌预测和零射击启动有害）。我们训练最多6.7b参数的模型，并找到差异以保持一致。虽然先前的扩展工作集中在从左到右的自回归模型上，但我们的结果表明，这种方法与某些权衡相关，并且可能值得开发非常大的双向模型。

Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题