论文标题

通过对比度学习的低资源语言的bitext挖掘

Bitext Mining for Low-Resource Languages via Contrastive Learning

论文作者

Tan, Weiting, Koehn, Philipp

论文摘要

挖掘低资源语言的高质量bitexts具有挑战性。本文表明,语言模型的句子表示,并用多个负面等级损失(一个对比目标)进行了微调,有助于检索清洁的bitexts。实验表明,从我们的方法挖掘出的并行数据基本上优于低资源语言高价和Pashto的先前最新方法。

Mining high-quality bitexts for low-resource languages is challenging. This paper shows that sentence representation of language models fine-tuned with multiple negatives ranking loss, a contrastive objective, helps retrieve clean bitexts. Experiments show that parallel data mined from our approach substantially outperform the previous state-of-the-art method on low resource languages Khmer and Pashto.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源