在连续空间中的迭代改进

论文标题

在连续空间中的迭代改进

Iterative Refinement in the Continuous Space for Non-Autoregressive Neural Machine Translation

论文作者

Lee, Jason, Shu, Raphael, Cho, Kyunghyun

论文摘要

我们提出了一个有效的推理程序，用于非自动回归机器翻译，迭代地纯粹在连续空间中改进了翻译。给定机器翻译的连续潜在变量模型（Shu等，2020），我们训练一个推理网络，以近似目标句子的边缘对数概率的梯度，仅使用潜在变量作为输入。这使我们可以使用基于梯度的优化在推理时间找到目标句子，以近乎最大化其边际概率。由于每个细化步骤仅涉及在低维度的潜在空间中计算（我们在实验中使用了8个），因此我们避免了现有的非自动进取推理程序所产生的计算开销，而这些推理程序通常在代币空间中加入。我们将我们的方法与最近提出的类似EM样的推理程序（Shu等，2020）进行了比较，该过程在混合空间中优化，包括离散变量和连续变量。 We evaluate our approach on WMT'14 En-De, WMT'16 Ro-En and IWSLT'16 De-En, and observe two advantages over the EM-like inference: (1) it is computationally efficient, i.e. each refinement step is twice as fast, and (2) it is more effective, resulting in higher marginal probabilities and BLEU scores with the same number of refinement steps.例如，在WMT'14 en-de上，我们的方法能够比自回归模型的6.2倍（0.9 bleu）快6.2倍。

We propose an efficient inference procedure for non-autoregressive machine translation that iteratively refines translation purely in the continuous space. Given a continuous latent variable model for machine translation (Shu et al., 2020), we train an inference network to approximate the gradient of the marginal log probability of the target sentence, using only the latent variable as input. This allows us to use gradient-based optimization to find the target sentence at inference time that approximately maximizes its marginal probability. As each refinement step only involves computation in the latent space of low dimensionality (we use 8 in our experiments), we avoid computational overhead incurred by existing non-autoregressive inference procedures that often refine in token space. We compare our approach to a recently proposed EM-like inference procedure (Shu et al., 2020) that optimizes in a hybrid space, consisting of both discrete and continuous variables. We evaluate our approach on WMT'14 En-De, WMT'16 Ro-En and IWSLT'16 De-En, and observe two advantages over the EM-like inference: (1) it is computationally efficient, i.e. each refinement step is twice as fast, and (2) it is more effective, resulting in higher marginal probabilities and BLEU scores with the same number of refinement steps. On WMT'14 En-De, for instance, our approach is able to decode 6.2 times faster than the autoregressive model with minimal degradation to translation quality (0.9 BLEU).

下载PDF全文

下载文献需遵守相关版权规定

论文标题