论文标题

故事板:优化预先计算的汇总汇总

Storyboard: Optimizing Precomputed Summaries for Aggregation

论文作者

Gan, Edward, Bailis, Peter, Charikar, Moses

论文摘要

每个细分市场的新兴数据系统将其数据分配并预先计算近似摘要(即草图和样本),以降低查询成本。然后,它们可以汇总并组合段摘要以估算结果,而无需扫描原始数据。但是,鉴于有限的存储空间,每个摘要引入了影响查询准确性的近似错误。例如,使用现有可合并摘要的系统不能将查询错误降低到单个预先计算的摘要的错误之下。我们介绍了故事板,这是一个查询系统,可在汇总多个段时优化项目频率和分位数摘要,以确保准确性。与常规的可合并摘要相比,情节板利用其他内存可用于汇总构造和聚合来得出更精确的组合结果。与标准摘要方法相比,这将误差降低了25倍,而在工业数据集上的数据立方体聚合比数据立方体聚合减少了25倍,并具有可证明的最差案例错误保证。

An emerging class of data systems partition their data and precompute approximate summaries (i.e., sketches and samples) for each segment to reduce query costs. They can then aggregate and combine the segment summaries to estimate results without scanning the raw data. However, given limited storage space each summary introduces approximation errors that affect query accuracy. For instance, systems that use existing mergeable summaries cannot reduce query error below the error of an individual precomputed summary. We introduce Storyboard, a query system that optimizes item frequency and quantile summaries for accuracy when aggregating over multiple segments. Compared to conventional mergeable summaries, Storyboard leverages additional memory available for summary construction and aggregation to derive a more precise combined result. This reduces error by up to 25x over interval aggregations and 4.4x over data cube aggregations on industrial datasets compared to standard summarization methods, with provable worst-case error guarantees.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源