论文标题
大规模数据上的近似分位数调查(技术报告)
A Survey of Approximate Quantile Computation on Large-scale Data (Technical Report)
论文作者
论文摘要
随着数据量的广泛增长,数据分析有助于提取大规模数据的元数据。但是,很难计算一种元数据,即顺序统计,因为它们不是可合并的或增量的。因此,时间和记忆空间的局限性不支持其在大规模数据上的计算。在本文中,我们专注于订单统计数据,并介绍对近似分位数计算研究的全面分析。确定性算法和随机算法都涵盖了计算流媒体模型或分布式模型的大约分位数的随机算法。然后,提出了多种用于提高各种情况下近似分位数算法效率和性能的多种技术,例如偏斜的数据和高速数据流。最后,我们以不同语言的现有包裹的覆盖范围,并简要讨论了该领域的未来方向。
As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models are covered. Then, multiple techniques for improving the efficiency and performance of approximate quantile algorithms in various scenarios, such as skewed data and high-speed data streams, are presented. Finally, we conclude with coverage of existing packages in different languages and with a brief discussion of the future direction in this area.