通过正则表达式的骨料比提取的复杂性

论文标题

通过正则表达式的骨料比提取的复杂性

The Complexity of Aggregates over Extractions by Regular Expressions

论文作者

Doleschal, Johannes, Kimelfeld, Benny, Martens, Wim

论文摘要

带有捕获变量的正则表达式，也称为正则形式，从文本中提取跨度的关系（由其起始索引和最终索引识别）。反过来，常规文档跨度类别是关系代数下的正则公式的关闭。我们在常规文档跨度的基础上研究了查询文本的计算复杂性，例如总和，平均值和分位数。为此，我们在常规文档跨度上正式定义了聚集功能，并分析了精确和近似计算的计算复杂性。更确切地说，我们表明，在有限的情况下，所有研究的骨料函数都可以在多项式时间内计算。但是，通常，即使精确的计算是棘手的，仍然可以使用完全多项式的随机近似方案（FPRAS）近似某些聚集体。

Regular expressions with capture variables, also known as regex-formulas, extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).

下载PDF全文

下载文献需遵守相关版权规定

论文标题