关于学习非自动回归变压器的

论文标题

关于学习非自动回归变压器的

On the Learning of Non-Autoregressive Transformers

论文作者

Huang, Fei, Tao, Tianhua, Zhou, Hao, Li, Lei, Huang, Minlie

论文摘要

非自动性变压器（NAT）是文本生成模型的家族，旨在通过并行预测整个句子来减少解码延迟。但是，这种延迟减少牺牲了捕获从左到右的依赖性的能力，从而使NAT学习非常具有挑战性。在本文中，我们介绍了理论和经验分析，以揭示NAT学习的挑战，并提出统一的观点，以了解现有的成功。首先，我们表明，简单地通过最大化可能性来训练NAT可能会导致边际分布的近似值，但在代币之间删除了所有依赖项，在该数据集的条件总相关可以测量删除的信息。其次，我们在统一的框架中正式化了许多以前的目标，并表明他们的成功可以得出结论，以最大程度地提高代理分布的可能性，从而减少了信息损失。实证研究表明，我们的观点可以解释NAT学习中的现象，并指导新培训方法的设计。

Non-autoregressive Transformer (NAT) is a family of text generation models, which aims to reduce the decoding latency by predicting the whole sentences in parallel. However, such latency reduction sacrifices the ability to capture left-to-right dependencies, thereby making NAT learning very challenging. In this paper, we present theoretical and empirical analyses to reveal the challenges of NAT learning and propose a unified perspective to understand existing successes. First, we show that simply training NAT by maximizing the likelihood can lead to an approximation of marginal distributions but drops all dependencies between tokens, where the dropped information can be measured by the dataset's conditional total correlation. Second, we formalize many previous objectives in a unified framework and show that their success can be concluded as maximizing the likelihood on a proxy distribution, leading to a reduced information loss. Empirical studies show that our perspective can explain the phenomena in NAT learning and guide the design of new training methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题