论文标题
使用单语言数据改善低资源神经机器翻译的混合方法
A Hybrid Approach for Improved Low Resource Neural Machine Translation using Monolingual Data
论文作者
论文摘要
许多语言对的资源低,这意味着可用的并行数据的数量和/或质量不足以训练神经机器翻译(NMT)模型,该模型可以达到可接受的准确性标准。许多作品都使用一种或两种语言的随便可用的单语言数据进行了探索,以改善低调甚至高资源语言的翻译模型标准。此类作品中最成功的一项是反翻译,该反向翻译利用目标语言单语言数据的翻译来增加培训数据的数量。已证明对可用的并行数据训练的后退模型质量已被证明可以确定反向翻译方法的性能。尽管如此,在标准背译中的单语言目标数据上,只有向前模型得到改进。先前的一项研究提出了一种迭代反向翻译方法,用于在几个迭代中改善这两个模型。但是,与传统的反向翻译不同,它依赖于目标和源单语言数据。因此,这项工作提出了一种新颖的方法,可以通过自学和反向翻译的混合体从单语目标数据中受益。实验结果表明,在英国 - 德国低资源神经机器翻译上,提出的方法比传统的反向翻译方法的优越性。我们还提出了一种迭代的自学习方法,该方法的表现优于迭代反向翻译,同时也仅依靠单语目标数据,并需要培训更少的模型。
Many language pairs are low resource, meaning the amount and/or quality of available parallel data is not sufficient to train a neural machine translation (NMT) model which can reach an acceptable standard of accuracy. Many works have explored using the readily available monolingual data in either or both of the languages to improve the standard of translation models in low, and even high, resource languages. One of the most successful of such works is the back-translation that utilizes the translations of the target language monolingual data to increase the amount of the training data. The quality of the backward model which is trained on the available parallel data has been shown to determine the performance of the back-translation approach. Despite this, only the forward model is improved on the monolingual target data in standard back-translation. A previous study proposed an iterative back-translation approach for improving both models over several iterations. But unlike in the traditional back-translation, it relied on both the target and source monolingual data. This work, therefore, proposes a novel approach that enables both the backward and forward models to benefit from the monolingual target data through a hybrid of self-learning and back-translation respectively. Experimental results have shown the superiority of the proposed approach over the traditional back-translation method on English-German low resource neural machine translation. We also proposed an iterative self-learning approach that outperforms the iterative back-translation while also relying only on the monolingual target data and require the training of less models.