论文标题
基因组学数据分析通过光谱形状和拓扑结构
Genomics Data Analysis via Spectral Shape and Topology
论文作者
论文摘要
Mapper是一种拓扑算法,经常用作构建数据图形表示的探索性工具。这种表示可以帮助更好地了解高维基因组数据的内在形状,并使用标准减少尺寸减少算法可能会丢失的信息。我们提出了一个新的工作流程,以处理和分析来自肿瘤和健康受试者的RNA-seq数据,以整合映射器和差异基因表达。确切地说,我们表明一种高斯混合近似方法可用于生成成功分离肿瘤和健康受试者的图形结构,并产生两个亚组的肿瘤受试者。使用DESEQ2(一种用于检测差异表达基因的流行工具)的进一步分析表明,这两个亚组的肿瘤细胞具有两个不同的基因法规,这表明形成肺癌的两种离散途径,这些途径不能由其他流行的聚类方法突出显示,包括T-SNE。尽管Mapper在分析高维数据方面表现出了希望,但是在现有文献中,构建统计分析映射器图形结构的工具受到限制。在本文中,我们使用热核标志开发了一种评分方法,该方法为统计推断(例如假设检验,灵敏度分析和相关分析)提供了经验环境。
Mapper, a topological algorithm, is frequently used as an exploratory tool to build a graphical representation of data. This representation can help to gain a better understanding of the intrinsic shape of high-dimensional genomic data and to retain information that may be lost using standard dimension-reduction algorithms. We propose a novel workflow to process and analyze RNA-seq data from tumor and healthy subjects integrating Mapper and differential gene expression. Precisely, we show that a Gaussian mixture approximation method can be used to produce graphical structures that successfully separate tumor and healthy subjects, and produce two subgroups of tumor subjects. A further analysis using DESeq2, a popular tool for the detection of differentially expressed genes, shows that these two subgroups of tumor cells bear two distinct gene regulations, suggesting two discrete paths for forming lung cancer, which could not be highlighted by other popular clustering methods, including t-SNE. Although Mapper shows promise in analyzing high-dimensional data, building tools to statistically analyze Mapper graphical structures is limited in the existing literature. In this paper, we develop a scoring method using heat kernel signatures that provides an empirical setting for statistical inferences such as hypothesis testing, sensitivity analysis, and correlation analysis.