Phenotagger：一种使用人类表型的表型概念识别的混合方法

论文标题

Phenotagger：一种使用人类表型的表型概念识别的混合方法

PhenoTagger: A Hybrid Method for Phenotype Concept Recognition using Human Phenotype Ontology

论文作者

Luo, Ling, Yan, Shankai, Lai, Po-Ting, Veltri, Daniel, Oler, Andrew, Xirasagar, Sandhya, Ghosh, Rajarshi, Similuk, Morgan, Robinson, Peter N., Lu, Zhiyong

论文摘要

在生物医学文本挖掘研究中，来自非结构化文本的自动表型概念识别仍然是一项具有挑战性的任务。解决该任务的先前工作通常使用基于字典的匹配方法，这些方法可以达到高精度，但会遭受较低的召回率。最近，已经提出了基于机器学习的方法来识别生物医学概念，该概念可以通过自动特征学习来识别更多看不见的概念同义词。但是，大多数方法都需要大量的手动注释数据进行模型培训，这是由于人类注释的高成本而难以获得的。在本文中，我们提出了一种混合方法，它结合了词典和基于机器学习的方法，以识别非结构化生物医学文本中的人类表型本体论（HPO）概念。我们首先使用HPO中的所有概念和同义词来构建字典，然后将其用于自动构建一个远距离监督的培训数据集以进行机器学习。接下来，培训了一个尖端的深度学习模型，以将每个候选短语（从输入句子中的N-gram）分类为相应的概念标签。最后，将基于词典和机器学习的预测结果组合在一起以提高性能。我们的方法通过两个HPO COLIDA进行了验证，结果表明，Phenotagger与以前的方法相比有利。此外，为了证明我们方法的普遍性，我们使用疾病概念识别疾病本体医学识别疾病的疾病训练来研究培训对不同本体论的影响。 NCBI疾病语料库的实验结果表明，与最先进的监督方法相比，无需手动注释的培训数据而无需手动注释的培训数据就能达到竞争性能。

Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation. In this paper, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary, which is then used to automatically build a distantly supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase (n-gram from input sentence) into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题