论文标题

Kencorpus:Swahili,Dholuo和Luhya的肯尼亚语言,用于自然语言处理任务

Kencorpus: A Kenyan Language Corpus of Swahili, Dholuo and Luhya for Natural Language Processing Tasks

论文作者

Wanjawa, Barack, Wanzare, Lilian, Indede, Florence, McOnyango, Owen, Ombui, Edward, Muchemi, Lawrence

论文摘要

在自然语言处理中,土著非洲语言被归类为服务不足。因此,他们经历了不良的数字包容性和信息访问。这种语言的处理挑战是如何使用机器学习和深度学习模型,而无需数据。 Kencorpus项目打算通过收集和存储文本和语音数据来弥合这一差距,这些文本和语音数据足以容纳在机器翻译,多语言社区中的问题回答和转录等应用程序中的数据驱动解决方案。 Kencorpus数据集是肯尼亚主要使用的三种语言的文本和语音语料库:斯瓦希里语,Dholuo和Luhya。数据收集是由社区,学校,媒体和出版商的研究人员完成的。 Kencorpus的数据集包含5594个项目的集合-4,442个文本(560万个单词)和1,152个语音文件(177hrs)。基于这些数据,开发了Dholuo和Luhya的语音标记集(分别为50,000和93,000个单词)。我们为斯瓦希里语开发了7,537个问答对,并创建了一个文本翻译集,其中包括Dholuo和Luhya到Swahili的13,400个句子。数据集可用于下游机器学习任务,例如模型培训和翻译。我们还开发了两个概念系统证明:针对以下问题的基斯瓦希里语音和机器学习系统,用于回答任务,结果分别为18.87%的单词错误率和80%的精确匹配(EM)。这些最初的结果使Kencorpus对机器学习社区的可用性有很大的希望。 Kencorpus是这三种低资源语言的少数公共领域语料库之一,并且构成了学习和共享类似作品的经验的基础,尤其是对于低资源语言。

Indigenous African languages are categorized as under-served in Natural Language Processing. They therefore experience poor digital inclusivity and information access. The processing challenge with such languages has been how to use machine learning and deep learning models without the requisite data. The Kencorpus project intends to bridge this gap by collecting and storing text and speech data that is good enough for data-driven solutions in applications such as machine translation, question answering and transcription in multilingual communities. The Kencorpus dataset is a text and speech corpus for three languages predominantly spoken in Kenya: Swahili, Dholuo and Luhya. Data collection was done by researchers from communities, schools, media, and publishers. The Kencorpus' dataset has a collection of 5,594 items - 4,442 texts (5.6M words) and 1,152 speech files (177hrs). Based on this data, Part of Speech tagging sets for Dholuo and Luhya (50,000 and 93,000 words respectively) were developed. We developed 7,537 Question-Answer pairs for Swahili and created a text translation set of 13,400 sentences from Dholuo and Luhya into Swahili. The datasets are useful for downstream machine learning tasks such as model training and translation. We also developed two proof of concept systems: for Kiswahili speech-to-text and machine learning system for Question Answering task, with results of 18.87% word error rate and 80% Exact Match (EM) respectively. These initial results give great promise to the usability of Kencorpus to the machine learning community. Kencorpus is one of few public domain corpora for these three low resource languages and forms a basis of learning and sharing experiences for similar works especially for low resource languages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源