L3Cube-Mahanlp：马拉地语自然语言处理数据集，模型和库

论文标题

L3Cube-Mahanlp：马拉地语自然语言处理数据集，模型和库

L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library

论文作者

Joshi, Raviraj

论文摘要

尽管是印度第三大流行的语言，但马拉地语缺乏有用的NLP资源。此外，流行的NLP库不支持马拉地语。借助L3Cube-Mahanlp，我们旨在建立资源和马拉地语自然语言处理的图书馆。我们为有监督的任务介绍了数据集和变压器模型，例如情感分析，命名实体识别和仇恨语音检测。我们还为无监督的语言建模任务发表了一个单语的马拉地语料库。总体而言，我们介绍了Mahacorpus，Mahasent，Mahaner和Mahahate数据集及其在这些数据集中微调的相应Mahabert模型。我们的目标是前进基准数据集并为马拉地语准备有用的资源。资源可在https://github.com/l3cube-pune/marathinlp上找到。

Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. Moreover, popular NLP libraries do not have support for the Marathi language. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection. We have also published a monolingual Marathi corpus for unsupervised language modeling tasks. Overall we present MahaCorpus, MahaSent, MahaNER, and MahaHate datasets and their corresponding MahaBERT models fine-tuned on these datasets. We aim to move ahead of benchmark datasets and prepare useful resources for Marathi. The resources are available at https://github.com/l3cube-pune/MarathiNLP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题