论文标题

Hausamt v1.0:迈向英语 - 豪萨神经机器翻译

HausaMT v1.0: Towards English-Hausa Neural Machine Translation

论文作者

Akinfaderin, Adewale

论文摘要

由于缺乏大量的平行数据和语言多样性,用于低资源语言的神经机器翻译(NMT)的性能低。为了改善此问题,我们为英语Hausa Machine Translation建立了基线模型,这被认为是低资源语言的任务。豪萨语是仅次于阿拉伯语的世界第二大亚裔语言,它是仅次于英语和法语的西非国家的第三大语言。在本文中,我们策划了包含Hausa-英语平行语料库的不同数据集进行翻译。我们培训了基线模型,并使用具有两种令牌化方法的复发和变压器编码器架构进行了评估模型的性能:标准单词级别令牌化和字节对编码(BPE)子单词标记化。

Neural Machine Translation (NMT) for low-resource languages suffers from low performance because of the lack of large amounts of parallel data and language diversity. To contribute to ameliorating this problem, we built a baseline model for English-Hausa machine translation, which is considered a task for low-resource language. The Hausa language is the second largest Afro-Asiatic language in the world after Arabic and it is the third largest language for trading across a larger swath of West Africa countries, after English and French. In this paper, we curated different datasets containing Hausa-English parallel corpus for our translation. We trained baseline models and evaluated the performance of our models using the Recurrent and Transformer encoder-decoder architecture with two tokenization approaches: standard word-level tokenization and Byte Pair Encoding (BPE) subword tokenization.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源