通过利用辅助语言和神经机器翻译的数据选择进行预培训

论文标题

通过利用辅助语言和神经机器翻译的数据选择进行预培训

Pre-training via Leveraging Assisting Languages and Data Selection for Neural Machine Translation

论文作者

Song, Haiyue, Dabre, Raj, Mao, Zhuoyuan, Cheng, Fei, Kurohashi, Sadao, Sumita, Eiichiro

论文摘要

已知使用大型单语言数据的序列到序列（S2S）预训练可以提高低资源设置中各种S2S NLP任务的性能。但是，大型的单语中心语料可能并不总是可用于感兴趣的语言（LOI）。为此，我们建议利用其他语言的单语库来补充单语库中的稀缺性。低资源的日语神经机器翻译（NMT）的案例研究表明，利用大型中文和法国单语言语料库的案例研究可以帮助克服S2S预培训的日本和英语单语言语料库的短缺。我们进一步展示了如何利用脚本映射（中文到日语）来增加两个单语言语料库之间的相似性，从而进一步改善翻译质量。此外，我们建议在预训练之前使用简单的数据选择技术，从而显着影响S2S预训练的质量。对我们提出的方法的经验比较表明，利用辅助语言单语言语料库，数据选择和脚本映射对于低资源场景中的NMT预培训极为重要。

Sequence-to-sequence (S2S) pre-training using large monolingual data is known to improve performance for various S2S NLP tasks in low-resource settings. However, large monolingual corpora might not always be available for the languages of interest (LOI). To this end, we propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the LOI. A case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora, respectively, for S2S pre-training. We further show how to utilize script mapping (Chinese to Japanese) to increase the similarity between the two monolingual corpora leading to further improvements in translation quality. Additionally, we propose simple data-selection techniques to be used prior to pre-training that significantly impact the quality of S2S pre-training. An empirical comparison of our proposed methods reveals that leveraging assisting language monolingual corpora, data selection and script mapping are extremely important for NMT pre-training in low-resource scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题