论文标题
饮食上的变压器
Transformer on a Diet
论文作者
论文摘要
由于其能够以有效的方式捕获序列信息,变压器已被广泛使用。但是,最近的发展(例如BERT和GPT-2)仅提供重心的繁重的建筑,重点是有效性。在本文中,我们探索了三个精心设计的光变压器架构,以找出较少计算的变压器是否可以产生竞争结果。语言模型基准数据集的实验结果暗示,这种权衡是有希望的,而光变压器充其量最多减少70%的参数,而与标准变压器相比,竞争性的困惑性。源代码可公开可用。
Transformer has been widely used thanks to its ability to capture sequence information in an efficient way. However, recent developments, such as BERT and GPT-2, deliver only heavy architectures with a focus on effectiveness. In this paper, we explore three carefully-designed light Transformer architectures to figure out whether the Transformer with less computations could produce competitive results. Experimental results on language model benchmark datasets hint that such trade-off is promising, and the light Transformer reduces 70% parameters at best, while obtains competitive perplexity compared to standard Transformer. The source code is publicly available.