论文标题
从网络表中发现新颖的实体
Novel Entity Discovery from Web Tables
论文作者
论文摘要
与任何类型的知识库(KB)一起工作时,必须确保它是完整的,并且也尽可能最新。这两项任务都是不平凡的,因为它们需要以召回式的努力来确定KB缺少哪些实体和关系。因此,他们需要大量的劳动。另一方面,网络上的桌子很丰富,并且具有协助这些任务的独特潜力。特别是,我们可以利用此类表中的内容来发现新的实体,属性和关系。由于Web表通常仅包含原始文本内容,因此我们首先需要确定哪些单元格是指哪些已知实体---我们将匹配的任务匹配。第一个任务旨在通过将表单元和标题列连接到KB的元素来推断表语义。然后,第二个任务建立在这些链接的实体和属性上,不仅在同一表中识别新颖的实体和属性,还可以引导其类型和其他关系。我们将此过程称为新的实体发现,据我们所知,这是在网络表中挖掘未链接细胞的第一项努力。我们的方法不仅确定了kb(``新颖'')信息,还标识了in-kb(``已知'')实体的新颖别名。当使用三个专用测试集进行评估时,我们发现我们提出的方法在与基准相比的精度方面得到了明显的改进,同时保持回忆稳定。
When working with any sort of knowledge base (KB) one has to make sure it is as complete and also as up-to-date as possible. Both tasks are non-trivial as they require recall-oriented efforts to determine which entities and relationships are missing from the KB. As such they require a significant amount of labor. Tables on the Web, on the other hand, are abundant and have the distinct potential to assist with these tasks. In particular, we can leverage the content in such tables to discover new entities, properties, and relationships. Because web tables typically only contain raw textual content we first need to determine which cells refer to which known entities---a task we dub table-to-KB matching. This first task aims to infer table semantics by linking table cells and heading columns to elements of a KB. Then second task builds upon these linked entities and properties to not only identify novel ones in the same table but also to bootstrap their type and additional relationships. We refer to this process as novel entity discovery and, to the best of our knowledge, it is the first endeavor on mining the unlinked cells in web tables. Our method identifies not only out-of-KB (``novel'') information but also novel aliases for in-KB (``known'') entities. When evaluated using three purpose-built test collections, we find that our proposed approaches obtain a marked improvement in terms of precision over our baselines whilst keeping recall stable.