论文标题
在合成数据流上采样算法的排名和基准测试框架
Ranking and benchmarking framework for sampling algorithms on synthetic data streams
论文作者
论文摘要
在大数据,AI和流处理的领域中,我们与来自多个来源的大量数据合作。由于内存和网络限制,我们处理分布式系统上的数据流以减轻计算和网络负载。当处理具有非均匀分布的数据流时,由于使用简单的哈希分区,我们经常观察到过载分区。为了解决这种不平衡,我们可以使用需要采样算法的动态分区算法来精确估计数据流的基础分布。没有测试这些算法的标准化方法。我们提供具有基准和超参数优化功能的可扩展排名框架,并为我们的框架提供可以处理概念漂移的数据生成器。 我们的工作包括用于动态微爆炸的生成器,我们可以应用于任何数据流。我们提供对概念漂移反应的算法,并使用我们的框架将这些算法与最先进的算法进行比较。
In the fields of big data, AI, and streaming processing, we work with large amounts of data from multiple sources. Due to memory and network limitations, we process data streams on distributed systems to alleviate computational and network loads. When data streams with non-uniform distributions are processed, we often observe overloaded partitions due to the use of simple hash partitioning. To tackle this imbalance, we can use dynamic partitioning algorithms that require a sampling algorithm to precisely estimate the underlying distribution of the data stream. There is no standardized way to test these algorithms. We offer an extensible ranking framework with benchmark and hyperparameter optimization capabilities and supply our framework with a data generator that can handle concept drifts. Our work includes a generator for dynamic micro-bursts that we can apply to any data stream. We provide algorithms that react to concept drifts and compare those against the state-of-the-art algorithms using our framework.