论文标题
UL2:统一语言学习范式
UL2: Unifying Language Learning Paradigms
论文作者
论文摘要
现有的预训练模型通常针对特定类别的问题。迄今为止,似乎仍然在正确的体系结构和预培训设置方面尚无共识。本文为在数据集和设置中普遍有效的预训练模型提供了一个统一的框架。我们首先将建筑原型解散为预训练的目标 - 两个通常混合的概念。接下来,我们提出了NLP中自学的广义和统一的观点,并展示如何互相施加不同的预训练目标,以及不同目标之间的插值如何有效。然后,我们提出了混合物(MOD),这是一个将各种预训练范式结合在一起的预训练目标。我们还引入了模式切换的概念,其中下游微调与特定的预训练方案有关。我们进行了广泛的消融实验,以比较多个训练预训练目标,并发现我们的方法通过在多种不同的设置上胜过类似T5&GPT的模型来推动帕累托 - 弗朗特。通过将模型扩展到20B参数,我们可以在50个基于监督的NLP任务上实现SOTA性能。我们的模型还可以在秘密学习中取得强大的结果,在零摄像的超粘合上优于175B GPT-3,并将T5-XXL的性能三倍。在0-SHOT MMLU上,UL2 20B的表现优于T0和T5型号。 UL2 20B也可以很好地与经过思考的促进和推理相吻合,这是对在中小型20b参数中进行推理的研究的吸引人选择。最后,我们将FLAN指导调整应用于UL2 20B模型,从而获得MMLU和Big-Bench得分与Flan-Palm 62B竞争。我们为UL2 20B和Flan-UL2 20B发布了基于亚麻的T5X检查点。
Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.