论文标题

使用另一种语言的低资源语言的语音标记(帖子)的一部分(使用标记的波斯语(FARSI)语料库为库尔德语(Sorani)开发POS标记的词典)

Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus)

论文作者

Hassani, Hossein

论文摘要

标记的Corpora在广泛的自然语言处理中起着至关重要的作用。语音标签(邮政)的一部分对于开发标记的语料库至关重要。它是耗费时间和昂贵的,因此,如果它是自动化的,则可能更实惠。库尔德语目前缺乏适当大小的公开标记的Corpora。标记公开可用的库尔德语料库可以将这些资源的能力提高到比原始语料库所提供的水平更高的水平。开发POS标记的词典可以协助上述任务。我们在波斯语(FARSI)中使用标记的语料库(Bijankhan语料库)作为库尔德语的紧密语言,以开发带有POS的词典。本文介绍了利用近距语言资源来丰富其资源的方法。根据CC BY-NC-SA 4.0许可,在https://kurdishblark.github.io/下公开可公开使用结果的部分数据集。我们计划在进一步调查结果后,使整个标记的语料库可用。该数据集可以帮助开发针对其他库尔德方言和自动化的库尔德语料库标记的POS标签词典。

Tagged corpora play a crucial role in a wide range of Natural Language Processing. The Part of Speech Tagging (POST) is essential in developing tagged corpora. It is time-and-effort-consuming and costly, and therefore, it could be more affordable if it is automated. The Kurdish language currently lacks publicly available tagged corpora of proper sizes. Tagging the publicly available Kurdish corpora can leverage the capability of those resources to a higher level than what raw or segmented corpora can provide. Developing POS-tagged lexicons can assist the mentioned task. We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon. This paper presents the approach of leveraging the resource of a close language to Kurdish to enrich its resources. A partial dataset of the results is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/. We plan to make the whole tagged corpus available after further investigation on the outcome. The dataset can help in developing POS-tagged lexicons for other Kurdish dialects and automated Kurdish corpora tagging.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源