论文标题

地理距离是新的超级参数:一个案例研究,是为英语 - iSizulu Machine Translation找到最佳的预训练语言

Geographical Distance Is The New Hyperparameter: A Case Study Of Finding The Optimal Pre-trained Language For English-isiZulu Machine Translation

论文作者

Nasir, Muhammad Umair, Mchechesi, Innocent Amos

论文摘要

由于诸如Isizulu之类的低资源语言的数据集和文本资源的可用性有限,因此非常需要能够利用预先培训的模型来利用知识来改善低资源机器的翻译。此外,缺乏处理形态丰富语言复杂性的技术使翻译模型的不平等发展变得更加复杂,许多语言被广泛遗忘了。这项研究探讨了在英语 - iSizulu翻译框架中转移学习的潜在好处。结果表明,从密切相关的语言中转移学习的价值以增强低资源翻译模型的性能,从而为低资源翻译提供了关键策略。我们收集了来自8种不同语言语料库的结果,其中包括一个多语言语料库,并且看到Isixhosa-Isizulu的表现优于所有语言,在测试集中,BLEU得分为8.56,从2.73的多种语音Corpora预培训的模型中更好。我们还得出了一种新系数,即NASIR的地理距离系数(NGDC),该系数为预训练的模型提供了简单的语言选择。 NGDC还指出,应该选择Isixhosa作为预训练模型的语言。

Stemming from the limited availability of datasets and textual resources for low-resource languages such as isiZulu, there is a significant need to be able to harness knowledge from pre-trained models to improve low resource machine translation. Moreover, a lack of techniques to handle the complexities of morphologically rich languages has compounded the unequal development of translation models, with many widely spoken African languages being left behind. This study explores the potential benefits of transfer learning in an English-isiZulu translation framework. The results indicate the value of transfer learning from closely related languages to enhance the performance of low-resource translation models, thus providing a key strategy for low-resource translation going forward. We gathered results from 8 different language corpora, including one multi-lingual corpus, and saw that isiXhosa-isiZulu outperformed all languages, with a BLEU score of 8.56 on the test set which was better from the multi-lingual corpora pre-trained model by 2.73. We also derived a new coefficient, Nasir's Geographical Distance Coefficient (NGDC) which provides an easy selection of languages for the pre-trained models. NGDC also indicated that isiXhosa should be selected as the language for the pre-trained model.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源