深度实体与预训练的语言模型匹配

论文标题

深度实体与预训练的语言模型匹配

Deep Entity Matching with Pre-Trained Language Models

论文作者

Li, Yuliang, Li, Jinfeng, Suhara, Yoshihiko, Doan, AnHai, Tan, Wang-Chiew

论文摘要

我们提出Ditto，这是一种基于预训练的基于变压器的语言模型的新型实体匹配系统。我们将EM定为序列对分类问题，以使用简单的体系结构来利用此类模型。我们的实验表明，在大型文本语料库中预先培训的语言模型的直接应用已经显着提高了匹配的质量，并且在基准数据集中最多占据了先前的最先前的F1（SOTA）（SOTA）。我们还开发了三种优化技术，以进一步提高同上的匹配能力。同上允许通过强调重要的输入信息来注入域知识，这些信息在做出匹配的决策时可能会引起人们的关注。同上还总结了太长的字符串，因此只保留基本信息并将其用于EM。最后，Ditto针对文本的数据增强进行了SOTA技术，以使用（困难）示例来增强培训数据。这样，Ditto被迫学习“更难”以提高模型的匹配能力。我们开发的优化进一步提高了同上的性能高达9.8％。也许更令人惊讶的是，我们确定同上可以使用标记数据的数量达到以前的SOTA结果。最后，我们证明了同上对现实世界大规模任务的有效性。在匹配由789K和412K记录组成的两个公司数据集时，Ditto的F1得分为96.5％。

We present Ditto, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Ditto's matching capability. Ditto allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. Ditto also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, Ditto adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, Ditto is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of Ditto by up to 9.8%. Perhaps more surprisingly, we establish that Ditto can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate Ditto's effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, Ditto achieves a high F1 score of 96.5%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题