如果我们没有维基百科怎么办？从大型新闻语料库中提取独立域的术语

论文标题

如果我们没有维基百科怎么办？从大型新闻语料库中提取独立域的术语

What if we had no Wikipedia? Domain-independent Term Extraction from a Large News Corpus

论文作者

Bilu, Yonatan, Gretz, Shai, Cohen, Edo, Slonim, Noam

论文摘要

过去二十年来最令人印象深刻的人类努力之一是以维基百科的自由且易于访问的格式收集和分类人类知识。在这项工作中，我们询问是什么术语值得进入知识大厦，并在维基百科中拥有自己的一页？这是持续的人类话语和讨论的自然产物，而不是维基百科编辑的特质选择？具体来说，我们旨在在大规模的新闻语料库中确定这种“值得Wiki的”术语，并查看这是否可以以否或最少依赖实际的Wikipedia条目来完成。我们建议这样做的五步管道，为所有五个提供基线结果，以及用于对其进行基准测试的相关数据集。我们的工作为特定于域的自动术语提取问题提供了新的启示，其问题是与域无关的变体。

One of the most impressive human endeavors of the past two decades is the collection and categorization of human knowledge in the free and accessible format that is Wikipedia. In this work we ask what makes a term worthy of entering this edifice of knowledge, and having a page of its own in Wikipedia? To what extent is this a natural product of on-going human discourse and discussion rather than an idiosyncratic choice of Wikipedia editors? Specifically, we aim to identify such "wiki-worthy" terms in a massive news corpus, and see if this can be done with no, or minimal, dependency on actual Wikipedia entries. We suggest a five-step pipeline for doing so, providing baseline results for all five, and the relevant datasets for benchmarking them. Our work sheds new light on the domain-specific Automatic Term Extraction problem, with the problem at hand being a domain-independent variant of it.

下载PDF全文

下载文献需遵守相关版权规定

论文标题