论文标题
r*-grove:大规模数据集的平衡空间分区
R*-Grove: Balanced Spatial Partitioning for Large-scale Datasets
论文作者
论文摘要
大空间数据的快速增长敦促研究界开发几个大空间数据系统。无论其架构如何,所有这些系统的基本要求之一是在机器之间有效地对数据进行空间分配。大空间分区的核心挑战是建立高空间质量分区,同时通过提供负载平衡的分区来同时采用分布式处理模型的优势。以前的大空间分区的工作是通过构建临时树来重复使用现有的索引搜索树,例如R-Tree家族,Str,KD-Tree和Quad-Tree,以示例输入并将其叶子节点作为分区边界。但是,我们在本文中表明,这些技术都没有完全解决上述挑战。本文提出了一种新型的分区方法,称为r*-grove,可以将非常大的空间数据集划分为具有出色的负载平衡和阻止利用率的高质量分区。这种吸引人的属性允许R*-grove在空间查询处理中胜过现有技术。 r*-grove可以轻松地集成到任何大数据平台中,例如Apache Spark或Apache Hadoop。我们的实验表明,r*-grove的表现优于大空间数据系统的现有分区技术。借助所有提议的工作作为开放源代码,我们设想社区将采用r*-grove来更好地服务大型空间数据研究。
The rapid growth of big spatial data urged the research community to develop several big spatial data systems. Regardless of their architecture, one of the fundamental requirements of all these systems is to spatially partition the data efficiently across machines. The core challenges of big spatial partitioning are building high spatial quality partitions while simultaneously taking advantages of distributed processing models by providing load balanced partitions. Previous works on big spatial partitioning are to reuse existing index search trees as-is, e.g., the R-tree family, STR, Kd-tree, and Quad-tree, by building a temporary tree for a sample of the input and use its leaf nodes as partition boundaries. However, we show in this paper that none of those techniques has addressed the mentioned challenges completely. This paper proposes a novel partitioning method, termed R*-Grove, which can partition very large spatial datasets into high quality partitions with excellent load balance and block utilization. This appealing property allows R*-Grove to outperform existing techniques in spatial query processing. R*-Grove can be easily integrated into any big data platforms such as Apache Spark or Apache Hadoop. Our experiments show that R*-Grove outperforms the existing partitioning techniques for big spatial data systems. With all the proposed work publicly available as open source, we envision that R*-Grove will be adopted by the community to better serve big spatial data research.