论文标题
扁平:快速,轻巧和准确的基数估计方法
FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation
论文作者
论文摘要
查询优化器依靠准确的基数估计(CARDEST)来制定良好的执行计划。 CARDEST的核心问题是如何以准确而紧凑的方式对属性的丰富关节分布进行建模。尽管进行了数十年的研究,但现有的方法要么仅使用独立的分解来简化模型,这会导致估计不准确,要么通过无效的条件分解而使它们复杂化,而无需任何独立假设,从而导致概率计算缓慢。在本文中,我们提出了Flat,这是一种Cardest方法,在概率计算中同时快速,模型大小轻巧,估计质量准确。平面的关键思想是一种新颖的无监督图形模型,称为FSPN。它利用独立和条件分解来自适应地模拟不同级别的属性相关性,从而使其优势相吻合。 FLAT支持在基础FSPN模型上几乎衬里时间的有效在线概率计算,提供有效的离线模型构建,并启用增量模型更新。它可以估计单个表查询和多表连接查询的基数。广泛的实验研究表明,在众所周知的IMDB基准上,平面比现有Cardest方法的优越性:Flat达到1至5个数量级的准确性,1至3个数量级的概率计算速度和1至2个数量级的存储成本较低。我们还将公寓集成到Postgres中以执行端到端测试。它的基准工作负载将查询执行时间提高了12.9%,使用真实的基数非常接近最佳结果14.2%。
Query optimizers rely on accurate cardinality estimation (CardEst) to produce good execution plans. The core problem of CardEst is how to model the rich joint distribution of attributes in an accurate and compact manner. Despite decades of research, existing methods either over simplify the models only using independent factorization which leads to inaccurate estimates, or over complicate them by lossless conditional factorization without any independent assumption which results in slow probability computation. In this paper, we propose FLAT, a CardEst method that is simultaneously fast in probability computation, lightweight in model size and accurate in estimation quality. The key idea of FLAT is a novel unsupervised graphical model, called FSPN. It utilizes both independent and conditional factorization to adaptively model different levels of attributes correlations, and thus dovetails their advantages. FLAT supports efficient online probability computation in near liner time on the underlying FSPN model, provides effective offline model construction and enables incremental model updates. It can estimate cardinality for both single table queries and multi table join queries. Extensive experimental study demonstrates the superiority of FLAT over existing CardEst methods on well known IMDB benchmarks: FLAT achieves 1 to 5 orders of magnitude better accuracy, 1 to 3 orders of magnitude faster probability computation speed and 1 to 2 orders of magnitude lower storage cost. We also integrate FLAT into Postgres to perform an end to end test. It improves the query execution time by 12.9% on the benchmark workload, which is very close to the optimal result 14.2% using the true cardinality.