论文标题
语言建模的大而多样的阿拉伯语语料库
A Large and Diverse Arabic Corpus for Language Modeling
论文作者
论文摘要
语言模型(LMS)引入了自然语言处理(NLP)建模的重大范式转变,其中大型预训练的LMS成为大多数NLP任务不可或缺的一部分。 LMS足够聪明,可以在没有任何监督的情况下找到该语言的有用且相关的表示。也许,与传统方法相比,这些模型用于微调典型的NLP任务。相反,对这些模型的培训需要大量的语料库,这是该语言的良好代表。由于大量英语语料库的可用性,英语LMS通常比其他语言的表现更好。这项工作详细阐述了大型阿拉伯语料库的设计和开发。它由500 GB的阿拉伯语清洁文本组成,该文本针对改善大型语言模型的跨域知识和下游概括能力。此外,该语料库用于大型阿拉伯语LM的培训。为了评估LM的有效性,对许多典型的NLP任务进行了微调。与在多语言BERT(MBERT)进行微调的任务相比,任务表明从4.5%提高到8.5%。据我所知,这是目前收集到的最大的清洁和多样化的阿拉伯语语料库。
Language models (LMs) have introduced a major paradigm shift in Natural Language Processing (NLP) modeling where large pre-trained LMs became integral to most of the NLP tasks. The LMs are intelligent enough to find useful and relevant representations of the language without any supervision. Perhaps, these models are used to fine-tune typical NLP tasks with significantly high accuracy as compared to the traditional approaches. Conversely, the training of these models requires a massively large corpus that is a good representation of the language. English LMs generally perform better than their other language counterparts, due to the availability of massive English corpora. This work elaborates on the design and development of a large Arabic corpus. It consists of over 500 GB of Arabic cleaned text targeted at improving cross-domain knowledge and downstream generalization capability of large-scale language models. Moreover, the corpus is utilized in the training of a large Arabic LM. In order to evaluate the effectiveness of the LM, a number of typical NLP tasks are fine-tuned. The tasks demonstrate a significant boost from 4.5 to 8.5% when compared to tasks fine-tuned on multi-lingual BERT (mBERT). To the best of my knowledge, this is currently the largest clean and diverse Arabic corpus ever collected.