预先训练的神经机器翻译模型上的增强课程学习

论文标题

预先训练的神经机器翻译模型上的增强课程学习

Reinforced Curriculum Learning on Pre-trained Neural Machine Translation Models

论文作者

Zhao, Mingjun, Wu, Haijiang, Niu, Di, Wang, Xiaoli

论文摘要

神经机器翻译（NMT）的竞争性能非常依赖大量培训数据。但是，获取高质量的翻译对需要专家知识，而且昂贵。因此，如何最好地利用具有多种质量和特征的样本的给定数据集成为NMT中的一个重要而研究的问题。已经将课程学习方法引入了NMT，以根据启发式方法（例如评估噪声和难度水平的评估）来规定数据输入顺序，以优化模型的性能。但是，现有方法需要从头开始培训，而实际上，大多数NMT模型已经在大数据上进行了预训练。而且，作为启发式方法，它们并不能很好地概括。在本文中，我们旨在通过重新选择原始培训集中的有影响力的数据样本来学习一项课程，以改善预训练的NMT模型，并将这项任务作为强化学习问题制定。具体而言，我们提出了一个基于确定性参与者批评的数据选择框架，在该框架中，评论家网络预测了由于某个样本而导致的模型性能的预期变化，而演员网络学会了从提出的随机样本中选择最佳样本。几个翻译数据集的实验表明，当原始批处理训练达到天花板时，我们的方法可以进一步提高NMT的性能，而无需使用其他新的培训数据，并且显着优于几种强大的基线方法。

The competitive performance of neural machine translation (NMT) critically relies on large amounts of training data. However, acquiring high-quality translation pairs requires expert knowledge and is costly. Therefore, how to best utilize a given dataset of samples with diverse quality and characteristics becomes an important yet understudied question in NMT. Curriculum learning methods have been introduced to NMT to optimize a model's performance by prescribing the data input order, based on heuristics such as the assessment of noise and difficulty levels. However, existing methods require training from scratch, while in practice most NMT models are pre-trained on big data already. Moreover, as heuristics, they do not generalize well. In this paper, we aim to learn a curriculum for improving a pre-trained NMT model by re-selecting influential data samples from the original training set and formulate this task as a reinforcement learning problem. Specifically, we propose a data selection framework based on Deterministic Actor-Critic, in which a critic network predicts the expected change of model performance due to a certain sample, while an actor network learns to select the best sample out of a random batch of samples presented to it. Experiments on several translation datasets show that our method can further improve the performance of NMT when original batch training reaches its ceiling, without using additional new training data, and significantly outperforms several strong baseline methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题