高通量关系提取算法开发将知识文章和电子健康记录关联

论文标题

高通量关系提取算法开发将知识文章和电子健康记录关联

High-throughput relation extraction algorithm development associating knowledge articles and electronic health records

论文作者

Lin, Yucong, Lu, Keming, Chen, Yulin, Hong, Chuan, Yu, Sheng

论文摘要

目的：医疗关系是医疗人工智能所需的医学知识图的核心组成部分。但是，传统算法开发过程对专家注释的要求创造了用于开采新关系的主要瓶颈。在本文中，我们介绍了Hi-Res，这是用于高通量关系提取算法开发的框架。我们还表明，将知识文章与电子健康记录（EHR）相结合可显着提高分类准确性。方法：我们使用从结构化数据库和半结构化网页获得的关系三联体来将目标语料库的句子标记为正面培训样本。还提供了两种方法，用于通过将正样品与幼稚的负样品结合使用，从而创建改进的负样品。我们提出了一个通用模型，该模型使用大规模预处理的语言模型和多种构想的关注来概括句子信息，然后与从EHRS训练的概念嵌入以进行关系预测。结果：我们将HI-RES框架应用于开发用于疾病的关系和疾病 - 定位关系的分类算法。创建数百万个句子作为培训数据。使用预审前的语言模型和基于EHR的嵌入单独的嵌入，比以前的模型的嵌入方式可提高相当大的准确性。分别将两组关系的相关性进一步提高到0.947和0.998，分别将准确性提高到0.947和0.998，分别比以前的模型高10-17个百分点。结论：Hi-Res是实现高通量和准确的关系提取算法开发的有效框架。

Objective: Medical relations are the core components of medical knowledge graphs that are needed for healthcare artificial intelligence. However, the requirement of expert annotation by conventional algorithm development processes creates a major bottleneck for mining new relations. In this paper, we present Hi-RES, a framework for high-throughput relation extraction algorithm development. We also show that combining knowledge articles with electronic health records (EHRs) significantly increases the classification accuracy. Methods: We use relation triplets obtained from structured databases and semistructured webpages to label sentences from target corpora as positive training samples. Two methods are also provided for creating improved negative samples by combining positive samples with naïve negative samples. We propose a common model that summarizes sentence information using large-scale pretrained language models and multi-instance attention, which then joins with the concept embeddings trained from the EHRs for relation prediction. Results: We apply the Hi-RES framework to develop classification algorithms for disorder-disorder relations and disorder-location relations. Millions of sentences are created as training data. Using pretrained language models and EHR-based embeddings individually provides considerable accuracy increases over those of previous models. Joining them together further tremendously increases the accuracy to 0.947 and 0.998 for the two sets of relations, respectively, which are 10-17 percentage points higher than those of previous models. Conclusion: Hi-RES is an efficient framework for achieving high-throughput and accurate relation extraction algorithm development.

下载PDF全文

下载文献需遵守相关版权规定

论文标题