论文标题
UBERT:一种新颖的语言模型,用于在UMLS Metathesaurus中大规模的同义预测
UBERT: A Novel Language Model for Synonymy Prediction at Scale in the UMLS Metathesaurus
论文作者
论文摘要
UMLS Metathesaurus整合了200多个生物医学源词汇。在Metathesaurus构建过程中,在词汇相似算法的协助下,人类编辑将同义词聚集在概念中。此过程容易出错,耗时。最近,为UMLS词汇对齐(UVA)任务开发了一个深度学习模型(LEXLM)。这项工作介绍了基于BERT的语言模型Ubert,通过监督的同义预测(SP)任务替换了原始下一个句子预测(NSP)任务。 Ubert对UMLS词汇比对(UVA)任务评估了Ubert对UMLS Metathesaurus构建过程的有效性。我们表明,Ubert的表现优于Lexlm,以及基于生物医学的模型。 UBERT表现的关键是专门为Ubert开发的同义预测任务,训练数据与UVA任务的紧密比对以及用于预审前UBERT的模型的相似性。
The UMLS Metathesaurus integrates more than 200 biomedical source vocabularies. During the Metathesaurus construction process, synonymous terms are clustered into concepts by human editors, assisted by lexical similarity algorithms. This process is error-prone and time-consuming. Recently, a deep learning model (LexLM) has been developed for the UMLS Vocabulary Alignment (UVA) task. This work introduces UBERT, a BERT-based language model, pretrained on UMLS terms via a supervised Synonymy Prediction (SP) task replacing the original Next Sentence Prediction (NSP) task. The effectiveness of UBERT for UMLS Metathesaurus construction process is evaluated using the UMLS Vocabulary Alignment (UVA) task. We show that UBERT outperforms the LexLM, as well as biomedical BERT-based models. Key to the performance of UBERT are the synonymy prediction task specifically developed for UBERT, the tight alignment of training data to the UVA task, and the similarity of the models used for pretrained UBERT.