神经机器翻译的代码切换预培训

论文标题

神经机器翻译的代码切换预培训

Code-switching pre-training for neural machine translation

论文作者

Yang, Zhen, Hu, Bojie, Han, Ambyera, Huang, Shen, Ju, Qi

论文摘要

本文提出了一种新的预训练方法，称为神经机器翻译（NMT）的代码切换预训练（简称CSP）。与传统的预训练方法随机掩盖了输入句子的某些片段，所提出的CSP随机用目标语言的翻译单词随机替换了源句子中的某些单词。具体而言，我们首先通过无监督的单词嵌入源和目标语言之间的映射，然后根据提取的翻译词典在输入句子中随机替换输入句子中的某些单词。 CSP采用编码器框架框架：其编码器将代码混合句子作为输入，其解码器预测输入句子的替换片段。通过这种方式，CSP能够通过明确地从源和目标单语语料库中提取的跨语性对准信息来预先培训NMT模型。此外，我们减轻了由[mask]等人造符号引起的预处理差异。为了验证所提出的方法的有效性，我们对无监督和监督的NMT进行了广泛的实验。实验结果表明，CSP在没有预训练或其他预训练方法的情况下对基准实现了显着改善。

This paper proposes a new pre-training method, called Code-Switching Pre-training (CSP for short) for Neural Machine Translation (NMT). Unlike traditional pre-training method which randomly masks some fragments of the input sentence, the proposed CSP randomly replaces some words in the source sentence with their translation words in the target language. Specifically, we firstly perform lexicon induction with unsupervised word embedding mapping between the source and target languages, and then randomly replace some words in the input sentence with their translation words according to the extracted translation lexicons. CSP adopts the encoder-decoder framework: its encoder takes the code-mixed sentence as input, and its decoder predicts the replaced fragment of the input sentence. In this way, CSP is able to pre-train the NMT model by explicitly making the most of the cross-lingual alignment information extracted from the source and target monolingual corpus. Additionally, we relieve the pretrain-finetune discrepancy caused by the artificial symbols like [mask]. To verify the effectiveness of the proposed method, we conduct extensive experiments on unsupervised and supervised NMT. Experimental results show that CSP achieves significant improvements over baselines without pre-training or with other pre-training methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题