使用摘要统计信息进行大数据工作负载的近似分区选择

论文标题

使用摘要统计信息进行大数据工作负载的近似分区选择

Approximate Partition Selection for Big-Data Workloads using Summary Statistics

论文作者

Rong, Kexin, Lu, Yao, Bailis, Peter, Kandula, Srikanth, Levis, Philip

论文摘要

许多大数据集群将数据存储在大量分区中，以支持粗糙的分区级粒度。结果，通过行级采样的近似查询处理效率低下，通常需要读取许多分区。在这项工作中，我们试图通过读取数据分区的一部分并以加权方式组合部分答案，而无需修改数据布局，以快速和大致回答查询。我们说明了如何使用一组预计的摘要统计信息有效地执行此查询处理，这些统计信息为分区和权重的选择提供了信息。我们开发了使用统计数据来评估分区的相似性和重要性的新颖手段。我们在几个数据集和数据布局上进行的实验表明，与统一分区采样相比，要达到相同的相对错误，我们的技术提供了从2.7 $ \ times $到$ 70 \ $ 70 \ times $ $减少所读取的分区数量，并且每个分区存储的统计数据所需的统计信息需要少于100kb。

Many big-data clusters store data in large partitions that support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sampling is inefficient, often requiring reads of many partitions. In this work, we seek to answer queries quickly and approximately by reading a subset of the data partitions and combining partial answers in a weighted manner without modifying the data layout. We illustrate how to efficiently perform this query processing using a set of pre-computed summary statistics, which inform the choice of partitions and weights. We develop novel means of using the statistics to assess the similarity and importance of partitions. Our experiments on several datasets and data layouts demonstrate that to achieve the same relative error compared to uniform partition sampling, our techniques offer from 2.7$\times$ to $70\times$ reduction in the number of partitions read, and the statistics stored per partition require fewer than 100KB.

下载PDF全文

下载文献需遵守相关版权规定

论文标题