论文标题
本体驱动的和弱监督的罕见疾病鉴定
Ontology-Driven and Weakly Supervised Rare Disease Identification from Clinical Notes
论文作者
论文摘要
计算文本表型是从临床注释中识别患有某些疾病和特征的患者的实践。由于很少有用于机器学习的案例和域专家的数据注释的需求,因此难得的疾病是具有挑战性的。我们提出了一种使用本体论和弱监督的方法,并使用双向变压器(例如BERT)的最新预训练的上下文表示。基于本体的框架包括两个步骤:(i)文本到umls,通过将上下文链接到统一医学语言系统(UMLS)中的概念来提取表型,并具有指定的实体识别和链接(NER+L)工具,SEMEHR,以及与自定义的规则和上下文提及表示; (ii)umls-to-to-ordo,将UMLS概念与孤子罕见疾病本体论(ORDO)中的罕见疾病相匹配。提出了弱监督的方法来学习一个表型确认模型,以改善链接的文本对umls,而没有域专家的注释数据。我们评估了三个临床数据集,模拟III摘要,模拟III放射学报告和NHS Tayside脑成像报告,来自美国和英国的两个机构,并提供注释。精度的提高(文本到UMLS链接的绝对得分超过30%至50%),与现有的NER+L工具SemeHR相比,几乎没有召回的损失。关于模拟III和NHS Tayside的放射学报告的结果与放电摘要一致。整体管道处理临床笔记可以提取罕见的疾病病例,其中大部分在结构化数据(手动分配的ICD代码)中没有受到平衡。我们讨论了弱监督方法的有用性,并提出了未来研究的方向。
Computational text phenotyping is the practice of identifying patients with certain disorders and traits from clinical notes. Rare diseases are challenging to be identified due to few cases available for machine learning and the need for data annotation from domain experts. We propose a method using ontologies and weak supervision, with recent pre-trained contextual representations from Bi-directional Transformers (e.g. BERT). The ontology-based framework includes two steps: (i) Text-to-UMLS, extracting phenotypes by contextually linking mentions to concepts in Unified Medical Language System (UMLS), with a Named Entity Recognition and Linking (NER+L) tool, SemEHR, and weak supervision with customised rules and contextual mention representation; (ii) UMLS-to-ORDO, matching UMLS concepts to rare diseases in Orphanet Rare Disease Ontology (ORDO). The weakly supervised approach is proposed to learn a phenotype confirmation model to improve Text-to-UMLS linking, without annotated data from domain experts. We evaluated the approach on three clinical datasets, MIMIC-III discharge summaries, MIMIC-III radiology reports, and NHS Tayside brain imaging reports from two institutions in the US and the UK, with annotations. The improvements in the precision were pronounced (by over 30% to 50% absolute score for Text-to-UMLS linking), with almost no loss of recall compared to the existing NER+L tool, SemEHR. Results on radiology reports from MIMIC-III and NHS Tayside were consistent with the discharge summaries. The overall pipeline processing clinical notes can extract rare disease cases, mostly uncaptured in structured data (manually assigned ICD codes). We discuss the usefulness of the weak supervision approach and propose directions for future studies.