清晰度 - 使用差异比较异质数据

论文标题

清晰度 - 使用差异比较异质数据

CLARITY -- Comparing heterogeneous data using dissimiLARITY

论文作者

Lawson, Daniel J., Solanki, Vinesh, Yanovich, Igor, Dellert, Johannes, Ruck, Damian, Endicott, Phillip

论文摘要

从不同学科集成数据集很困难，因为数据在含义，规模和可靠性上通常在质量上有所不同。当两个数据集描述相同的实体时，可以围绕实体之间的（DIS）相似性在此类不同的数据中保留了许多科学问题。我们的方法（清晰度）量化了整个数据集的一致性，确定了出现不一致的地方并有助于其解释。我们使用三种不同的比较来说明这一点：基因甲基化与表达，语言的演变与单词使用的演变以及国家级别的经济指标与文化信念。非参数方法对噪声和缩放的差异是鲁棒的，并且仅对数据的产生方式做出薄弱的假设。它通过将相似性分解为两个组成部分来运行：类似于聚类的“结构”成分，以及这些结构之间的基本“关系”。这允许使用“结构”的两个相似性矩阵之间的“结构比较”。根据适合每个数据集的重新采样来评估显着性。该软件Clarity可从https://github.com/danjlawson/clarity提供作为R软件包。

Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale, and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise, and aids in their interpretation. We illustrate this using three diverse comparisons: gene methylation vs expression, evolution of language sounds vs word use, and country-level economic metrics vs cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: a `structural' component analogous to a clustering, and an underlying `relationship' between those structures. This allows a `structural comparison' between two similarity matrices using their predictability from `structure'. Significance is assessed with the help of re-sampling appropriate for each dataset. The software, CLARITY, is available as an R package from https://github.com/danjlawson/CLARITY.

下载PDF全文

下载文献需遵守相关版权规定

论文标题