论文标题
在数据集中提取主要趋势:序列算法
Extracting the main trend in a dataset: the Sequencer algorithm
论文作者
论文摘要
科学家的目的是从对复杂世界的观察中提取简单性。此过程的一个重要组成部分是探索数据以寻找趋势。但是,实际上,这往往更像是一种艺术,而不是一门科学。在自然世界中存在的所有趋势中,一维趋势(通常称为序列)特别有趣,因为它们提供了对简单现象的见解。但是,有些人在以复杂的举止表达时要检测到挑战性。我们提出了Suequencer,该算法旨在旨在识别数据集中的主要趋势。它通过构造描述用一组指标和尺度计算的观测值之间的相似性来构建图表。利用连续趋势导致更多伸长图的事实,该算法可以确定数据的哪些方面与建立全局序列有关。这种方法可以超出提出的算法,并可以优化任何维度降低技术的参数。我们使用天文学,地质以及来自自然世界的图像的现实世界数据来证明音序器的力量。我们表明,在许多情况下,它的表现优于流行的T-SNE和UMAP降低技术。这种探索性数据分析的方法不依赖于任何参数的培训或调整,具有在广泛的科学领域中实现发现的潜力。源代码可在github上找到,我们在\ url {http://sequencer.org}提供在线接口。
Scientists aim to extract simplicity from observations of the complex world. An important component of this process is the exploration of data in search of trends. In practice, however, this tends to be more of an art than a science. Among all trends existing in the natural world, one-dimensional trends, often called sequences, are of particular interest as they provide insights into simple phenomena. However, some are challenging to detect as they may be expressed in complex manners. We present the Sequencer, an algorithm designed to generically identify the main trend in a dataset. It does so by constructing graphs describing the similarities between pairs of observations, computed with a set of metrics and scales. Using the fact that continuous trends lead to more elongated graphs, the algorithm can identify which aspects of the data are relevant in establishing a global sequence. Such an approach can be used beyond the proposed algorithm and can optimize the parameters of any dimensionality reduction technique. We demonstrate the power of the Sequencer using real-world data from astronomy, geology as well as images from the natural world. We show that, in a number of cases, it outperforms the popular t-SNE and UMAP dimensionality reduction techniques. This approach to exploratory data analysis, which does not rely on training nor tuning of any parameter, has the potential to enable discoveries in a wide range of scientific domains. The source code is available on github and we provide an online interface at \url{http://sequencer.org}.