论文标题

数据流聚类:评论

Data Stream Clustering: A Review

论文作者

Zubaroğlu, Alaettin, Atalay, Volkan

论文摘要

连接的设备的数量正在稳步增加,这些设备不断生成数据流。尽管有许多挑战,但数据流的实时处理引起了人们的兴趣。聚类是实时数据流处理的最合适的方法之一,因为它可以使用有关数据的先前信息较少,并且不需要标记的实例。但是,数据流聚类与许多方面的传统聚类不同,并且有几个具有挑战性的问题。在这里,我们提供有关数据流的概念和共同特征的信息,例如概念漂移,数据流的数据结构,时间窗口模型和离群值检测。我们全面回顾了最新的数据流聚类算法,并根据基础聚​​类技术,计算复杂性和聚类精度对其进行分析。给出了这些算法的比较以及仍然开放的问题。我们指出流行的数据流存储库和数据集,流处理工具和平台。还讨论了有关数据流聚类的开放问题。

Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源