论文标题
用户评论中的俄罗斯药物反应语料库和用于药物反应和有效性检测的神经模型
The Russian Drug Reaction Corpus and Neural Models for Drug Reactions and Effectiveness Detection in User Reviews
论文作者
论文摘要
俄罗斯药物反应语料库(RUDREC)是俄罗斯关于药品的新的部分注释的消费者评论语料库,用于检测与健康相关的命名实体和药品的有效性。语料库本身由两个部分组成,即原始部分和标记的部分。原始部分包括从包括社交媒体在内的各种互联网来源收集的140万个与健康相关的用户生成的文本。标记的部分包含500个有关药物治疗的消费者评论,该评论与药物和疾病有关的信息。句子的标签包括与健康有关的问题或缺席。另一项句子还在表达水平上标记,以鉴定细粒度的亚型,例如药物类和药物形式,药物适应症和药物反应。此外,我们提出了该语料库上指定实体识别(NER)和多标签句子分类任务的基线模型。我们的Rudr-Bert模型实现了NER任务中74.85%的宏F1分数。对于句子分类任务,我们的模型达到了68.82%的宏F1得分,比接受俄罗斯数据的BERT模型的分数获得7.47%。我们可以在https://github.com/cimm-kzn/rudrec中免费获得域特异性BERT模型的Rudrec语料库和预算的权重。
The Russian Drug Reaction Corpus (RuDReC) is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. The corpus itself consists of two parts, the raw one and the labelled one. The raw part includes 1.4 million health-related user-generated texts collected from various Internet sources, including social media. The labelled part contains 500 consumer reviews about drug therapy with drug- and disease-related information. Labels for sentences include health-related issues or their absence. The sentences with one are additionally labelled at the expression level for identification of fine-grained subtypes such as drug classes and drug forms, drug indications, and drug reactions. Further, we present a baseline model for named entity recognition (NER) and multi-label sentence classification tasks on this corpus. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the sentence classification task, our model achieves the macro F1 score of 68.82% gaining 7.47% over the score of BERT model trained on Russian data. We make the RuDReC corpus and pretrained weights of domain-specific BERT models freely available at https://github.com/cimm-kzn/RuDReC