论文标题
关于微调伯特的稳定性:误解,解释和强大的基线
On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines
论文作者
论文摘要
诸如BERT之类的微调预训练的基于预训练的变压器模型已成为在各种NLP基准中主导排行榜的常见做法。尽管微调模型的经验表现很强,但微调是一个不稳定的过程:训练具有多个随机种子的相同模型可能会导致任务性能的差异。以前的文献(Devlin等,2019; Lee等,2020; Dodge等,2020)确定了观察到的不稳定性的两个潜在原因:灾难性的遗忘和微调数据集的小规模。在本文中,我们表明这两个假设都无法解释微调的不稳定性。我们分析了Bert,Roberta和Albert在胶水基准的常用数据集上进行了微调,并表明观察到的不稳定性是由导致梯度消失的优化困难引起的。此外,我们表明,下游任务性能的剩余差异可以归因于泛化的差异,在这些概括中,具有相同训练损失的微调模型表现出明显不同的测试性能。基于我们的分析,我们提出了一个简单但强大的基线,该基线使基于BERT的微调模型明显比以前提出的方法更加稳定。可以在线获得复制我们结果的代码:https://github.com/uds-lsv/bert-stable-fine-tuning。
Fine-tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks. Despite the strong empirical performance of fine-tuned models, fine-tuning is an unstable process: training the same model with multiple random seeds can result in a large variance of the task performance. Previous literature (Devlin et al., 2019; Lee et al., 2020; Dodge et al., 2020) identified two potential reasons for the observed instability: catastrophic forgetting and small size of the fine-tuning datasets. In this paper, we show that both hypotheses fail to explain the fine-tuning instability. We analyze BERT, RoBERTa, and ALBERT, fine-tuned on commonly used datasets from the GLUE benchmark, and show that the observed instability is caused by optimization difficulties that lead to vanishing gradients. Additionally, we show that the remaining variance of the downstream task performance can be attributed to differences in generalization where fine-tuned models with the same training loss exhibit noticeably different test performance. Based on our analysis, we present a simple but strong baseline that makes fine-tuning BERT-based models significantly more stable than the previously proposed approaches. Code to reproduce our results is available online: https://github.com/uds-lsv/bert-stable-fine-tuning.