论文标题

wojood:嵌套阿拉伯语命名实体语料库,并使用伯特识别

Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT

论文作者

Jarrar, Mustafa, Khalilia, Mohammed, Ghanem, Sana

论文摘要

本文介绍了Wojood,这是一种阿拉伯语嵌套实体识别(NER)的语料库。当将一个实体提及嵌入另一个实体提及时,就会发生嵌套实体。 Wojood由大约550k的现代标准阿拉伯语(MSA)和方言代币组成,这些令牌由21种实体类型手动注释,包括人,组织,位置,活动和日期。更重要的是,语料库用嵌套实体注释,而不是更常见的平坦注释。数据包含约75K实体,其中22.5%嵌套。对语料库的通知者评估与Cohen's Kappa的0.979和0.976的F1分数表现出了强烈的一致性。为了验证我们的数据,我们使用该语料库基于多任务学习和阿拉伯(Ara​​bert Bert)来训练嵌套的NER模型。该模型的总体F1得分为0.884。我们的语料库,注释指南,源代码和预培训模型已公开可用。

This paper presents Wojood, a corpus for Arabic nested Named Entity Recognition (NER). Nested entities occur when one entity mention is embedded inside another entity mention. Wojood consists of about 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types including person, organization, location, event and date. More importantly, the corpus is annotated with nested entities instead of the more common flat annotations. The data contains about 75K entities and 22.5% of which are nested. The inter-annotator evaluation of the corpus demonstrated a strong agreement with Cohen's Kappa of 0.979 and an F1-score of 0.976. To validate our data, we used the corpus to train a nested NER model based on multi-task learning and AraBERT (Arabic BERT). The model achieved an overall micro F1-score of 0.884. Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源