椰子：构建数据系列索引的可扩展自下而上的方法

论文标题

椰子：构建数据系列索引的可扩展自下而上的方法

Coconut: a scalable bottom-up approach for building data series indexes

论文作者

Kondylakis, Haridimos, Dayan, Niv, Zoumpatianos, Kostas, Palpanas, Themis

论文摘要

许多现代应用程序会产生大量的数据系列，需要进行分析，需要有效的相似性搜索操作。但是，用于此目的的最新数据系列索引在性能或存储成本方面对大规模数据集的扩展不是很好。我们指出了一个问题，即无法对用于索引的数据系列的现有摘要进行分类，而将相似的数据系列保持在排序顺序上。这导致了两个设计问题。首先，无法使用基于排序的传统散装算法。取而代之的是，索引构造是通过慢自上而下的插入进行的，该插入会产生一个不连续的索引，从而导致许多随机的I/OS。其次，数据序列不能根据其中位数均匀地对节点进行排序和分配。因此，大多数叶子节点在实践中几乎是空的。这进一步降低了查询速度并扩大存储成本。为了解决这些问题，我们提出椰子。椰子中的第一个创新是一个倒置的，可排序的数据系列摘要，基于Z阶曲线组织数据系列，以分类的顺序保持相似的相似系列。结果，椰子能够使用批量加载技术，这些技术依靠排序来快速使用大序列磁盘I/OS快速构建连续索引。然后，我们探索基于前缀的基于前缀和基于中位数的分裂策略，用于自下而上的体积加载，表明基于中位数的分裂表现优于最新技术，从而确保所有节点都占繁殖。总体而言，我们从分析和经验上表明，椰子在建筑速度，查询速度和存储成本方面占主导地位的最先进的数据系列索引。

Many modern applications produce massive amounts of data series that need to be analyzed, requiring efficient similarity search operations. However, the state-of-the-art data series indexes that are used for this purpose do not scale well for massive datasets in terms of performance, or storage costs. We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order. This leads to two design problems. First, traditional bulk-loading algorithms based on sorting cannot be used. Instead, index construction takes place through slow top-down insertions, which create a non-contiguous index that results in many random I/Os. Second, data series cannot be sorted and split across nodes evenly based on their median value; thus, most leaf nodes are in practice nearly empty. This further slows down query speed and amplifies storage costs. To address these problems, we present Coconut. The first innovation in Coconut is an inverted, sortable data series summarization that organizes data series based on a z-order curve, keeping similar series close to each other in the sorted order. As a result, Coconut is able to use bulk-loading techniques that rely on sorting to quickly build a contiguous index using large sequential disk I/Os. We then explore prefix-based and median-based splitting policies for bottom-up bulk-loading, showing that median-based splitting outperforms the state of the art, ensuring that all nodes are densely populated. Overall, we show analytically and empirically that Coconut dominates the state-of-the-art data series indexes in terms of construction speed, query speed, and storage costs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题