在无监督的机器翻译中利用多语言的稀有语言

论文标题

在无监督的机器翻译中利用多语言的稀有语言

Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages

论文作者

Garcia, Xavier, Siddhant, Aditya, Firat, Orhan, Parikh, Ankur P.

论文摘要

无监督的翻译在资源丰富的语言对上取得了令人印象深刻的表现，例如英语 - 法语和英语 - 德语。但是，早期的研究表明，在涉及低资源，稀有语言，无监督翻译的更现实环境中的性能差，达到3.0 bleu。在这项工作中，我们表明，多语言对于使无监督的系统对于低资源设置实用至关重要。特别是，我们为5种低资源语言（古吉拉特语，哈萨克，尼泊尔，尼泊尔，僧伽罗和土耳其语）提供了一个单一模型，往返英语方向，通过三阶段培训方案，从其他高资源语言对的单语言和辅助平行数据中利用单语言和辅助平行数据。对于这些语言，我们的表现优于所有当前最新的无监督基线，可获得高达14.4 bleu的收益。此外，我们的表现要优于各种语言对的大量有监督的WMT提交，并且匹配了尼泊尔英语当前最新监督模型的性能。我们进行了一系列消融研究，以在不同程度的数据质量下建立模型的鲁棒性，并分析导致所提出方法优于传统无监督模型的卓越性能的因素。

Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems practical for low-resource settings. In particular, we present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions, which leverages monolingual and auxiliary parallel data from other high-resource language pairs via a three-stage training scheme. We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU. Additionally, we outperform a large collection of supervised WMT submissions for various language pairs as well as match the performance of the current state-of-the-art supervised model for Nepali-English. We conduct a series of ablation studies to establish the robustness of our model under different degrees of data quality, as well as to analyze the factors which led to the superior performance of the proposed approach over traditional unsupervised models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题