从长文档中提取基于查询的键形

论文标题

从长文档中提取基于查询的键形

Query-Based Keyphrase Extraction from Long Documents

论文作者

Docekal, Martin, Smrz, Pavel

论文摘要

自然语言处理力的基于变压器的架构输入大小限制，当需要处理长文档时，可能会出现问题。本文通过构成长文档来克服此问题来提取键形，同时保留全局上下文作为定义应提取相关密钥词的主题的查询。开发的系统采用了预训练的BERT模型，并将其调整以估计给定文本跨度形成键形的概率。我们在两个流行的数据集（Inspec和Semeval）以及一个大型新型数据集上使用各种上下文大小进行了实验。提出的结果表明，较短的上下文与查询的上下文相比，较长的上下文克服了一个较长的上下文，而没有查询长文档的查询。

Transformer-based architectures in natural language processing force input size limits that can be problematic when long documents need to be processed. This paper overcomes this issue for keyphrase extraction by chunking the long documents while keeping a global context as a query defining the topic for which relevant keyphrases should be extracted. The developed system employs a pre-trained BERT model and adapts it to estimate the probability that a given text span forms a keyphrase. We experimented using various context sizes on two popular datasets, Inspec and SemEval, and a large novel dataset. The presented results show that a shorter context with a query overcomes a longer one without the query on long documents.

下载PDF全文

下载文献需遵守相关版权规定

论文标题