内存变压器

论文标题

Memory Transformer

论文作者

Burtsev, Mikhail S., Kuratov, Yuri, Peganov, Anton, Sapunov, Grigory V.

论文摘要

基于变压器的模型已实现了许多自然语言处理任务。自我发挥的架构使变压器可以将序列所有元素的信息结合到上下文感知表示中。但是，有关上下文的信息主要存储在同一元素表示中。这可能会限制与序列相关的属性的处理，从而更加困难。添加可训练的内存为有选择的存储本地以及序列的全局表示是改善变压器模型的有前途的方向。记忆增强的神经网络（MANNS）扩展了传统的神经体系结构，并具有通用记忆的代表性。曼恩斯（Manns）证明了学习简单算法（例如复制或反向）的能力，可以通过反向传播成功培训从问题回答到语言建模的各种任务的训练，从而超过了相当复杂的RNN和LSTMS。在这项工作中，我们通过添加内存令牌来存储非本地表示，（2）为全局信息创建内存瓶颈，（3）用专用层控制内存更新，并研究了变压器基线（1）的几个扩展。我们评估了这些内存增强变压器，并证明内存的存在与机器翻译和语言建模任务的模型性能正相关。使用内存令牌的预训练的蒙版语言模型的增强显示了胶水基准的任务结果不同。记忆中注意力模式的可视化表明，它提高了模型处理全局环境的能力。

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware representations. However, information about the context is stored mostly in the same element-wise representations. This might limit the processing of properties related to the sequence as a whole more difficult. Adding trainable memory to selectively store local as well as global representations of a sequence is a promising direction to improve the Transformer model. Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations. MANNs have demonstrated the capability to learn simple algorithms like Copy or Reverse and can be successfully trained via backpropagation on diverse tasks from question answering to language modeling outperforming RNNs and LSTMs of comparable complexity. In this work, we propose and study few extensions of the Transformer baseline (1) by adding memory tokens to store non-local representations, (2) creating memory bottleneck for the global information, (3) controlling memory update with dedicated layer. We evaluate these memory augmented Transformers and demonstrate that presence of memory positively correlates with the model performance for machine translation and language modelling tasks. Augmentation of pre-trained masked language model with memory tokens shows mixed results for tasks from GLUE benchmark. Visualization of attention patterns over the memory suggest that it improves the model's ability to process a global context.

下载PDF全文

下载文献需遵守相关版权规定

论文标题