论文标题
椰子:静态和流数据系列可扩展索引的可分类汇总
Coconut: sortable summarizations for scalable indexes over static and streaming data series
论文作者
论文摘要
许多现代应用程序会产生大量的数据系列流,需要进行分析,需要有效的相似性搜索操作。但是,用于此目的的最新数据系列索引在性能或存储成本方面对大规模数据集的扩展不是很好。我们指出了一个问题,即无法对用于索引的数据系列的现有摘要进行分类,而将相似的数据系列保持在排序顺序上。为了解决这个问题,我们提出了椰子,这是基于可排序汇总的第一个数据系列索引,也是第一个用于索引和查询流媒体系列的有效解决方案。椰子中的第一个创新是一个倒置的,可排序的数据系列摘要,基于Z阶曲线组织数据系列,以分类的顺序保持相似的相似系列。结果,椰子能够使用依赖排序的批量加载和更新技术,以快速使用大型顺序磁盘I/OS快速构建和维护连续的索引。然后,我们探索基于前缀的基于前缀和基于中位数的分裂策略,用于自下而上的体积负载,这表明基于中间的分裂表现优于最新技术,从而确保所有节点都密集填充。最后,我们探讨了可排序汇总对可变尺寸窗口查询的影响,表明可以通过有效合并时间分区的更新来支持它们。总体而言,我们从分析和经验上表明,椰子在建筑速度,查询速度和存储成本方面占主导地位的最先进的数据系列索引。
Many modern applications produce massive streams of data series that need to be analyzed, requiring efficient similarity search operations. However, the state-of-the-art data series indexes that are used for this purpose do not scale well for massive datasets in terms of performance, or storage costs. We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order. To address this problem, we present Coconut, the first data series index based on sortable summarizations and the first efficient solution for indexing and querying streaming series. The first innovation in Coconut is an inverted, sortable data series summarization that organizes data series based on a z-order curve, keeping similar series close to each other in the sorted order. As a result, Coconut is able to use bulk loading and updating techniques that rely on sorting to quickly build and maintain a contiguous index using large sequential disk I/Os. We then explore prefix-based and median-based splitting policies for bottom-up bulk loading, showing that median-based splitting outperforms the state of the art, ensuring that all nodes are densely populated. Finally, we explore the impact of sortable summarizations on variable-sized window queries, showing that they can be supported in the presence of updates through efficient merging of temporal partitions. Overall, we show analytically and empirically that Coconut dominates the state-of-the-art data series indexes in terms of construction speed, query speed, and storage costs.