论文标题
通过反馈内存来解决变形金刚的某些局限
Addressing Some Limitations of Transformers with Feedback Memory
论文作者
论文摘要
尽管是前馈网络,但已成功地应用了变形金刚,已成功地应用于连续的自动回归任务。与复发性神经网络不同,变形金刚在处理输入令牌时并行处理时间关系。尽管这种并行化使它们在计算上有效,但它限制了模型完全利用输入的顺序性质。给定图层的表示形式只能从较低层访问表示,而不是已经可用的较高级别表示。在这项工作中,我们提出了反馈变压器体系结构,该体系结构将所有以前的表示形式展示到所有未来的表示形式,这意味着当前时间步的最低表示是由过去的最高级别抽象表示形成的。我们在语言建模,机器翻译和强化学习方面进行了各种基准,表明增加的表示能力可以创造出比可比变压器更强的较小的,浅层的模型。
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower layers, rather than the higher level representations already available. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.