论文标题
L3Cube-Mahanlp:马拉地语自然语言处理数据集,模型和库
L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library
论文作者
论文摘要
尽管是印度第三大流行的语言,但马拉地语缺乏有用的NLP资源。此外,流行的NLP库不支持马拉地语。借助L3Cube-Mahanlp,我们旨在建立资源和马拉地语自然语言处理的图书馆。我们为有监督的任务介绍了数据集和变压器模型,例如情感分析,命名实体识别和仇恨语音检测。我们还为无监督的语言建模任务发表了一个单语的马拉地语料库。总体而言,我们介绍了Mahacorpus,Mahasent,Mahaner和Mahahate数据集及其在这些数据集中微调的相应Mahabert模型。我们的目标是前进基准数据集并为马拉地语准备有用的资源。资源可在https://github.com/l3cube-pune/marathinlp上找到。
Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. Moreover, popular NLP libraries do not have support for the Marathi language. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection. We have also published a monolingual Marathi corpus for unsupervised language modeling tasks. Overall we present MahaCorpus, MahaSent, MahaNER, and MahaHate datasets and their corresponding MahaBERT models fine-tuned on these datasets. We aim to move ahead of benchmark datasets and prepare useful resources for Marathi. The resources are available at https://github.com/l3cube-pune/MarathiNLP.