论文标题
通过递归分区进行近乎最佳的分布式带连接
Near-Optimal Distributed Band-Joins through Recursive Partitioning
论文作者
论文摘要
我们考虑对分布式系统中的频段连接的运行时间优化,例如云。为了平衡跨工人机器的负载,必须对输入进行分区,这会导致重复。我们探讨了如何在每个工人的最大负载和两个关系之间的频段加入之间解决这种张力。先前的工作遭受了高优化成本或过度限制的被考虑的分区(导致次优联绩效)。我们的主要见解是,使用适当的分级评分措施对Join-Attribute空间进行递归分区可以达到低优化成本和低接合成本。这是第一种方法不仅对一维带连接有效,而且对于连接多个属性也是有效的。实验表明,我们的方法能够找到分区的分区,该分区均在下限的10%以内的每个工人的最大负载和输入重复范围内的各种设置,从而显着改善了先前的工作。
We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work.