贝叶斯组合多研究因子分析

论文标题

贝叶斯组合多研究因子分析

Bayesian Combinatorial Multi-Study Factor Analysis

论文作者

Grabski, Isabella N., De Vito, Roberta, Trippa, Lorenzo, Parmigiani, Giovanni

论文摘要

分析多项研究允许从一系列来源和人群中利用数据，但是直到最近，有限的方法可以对多个高维研究进行无监督分析。最近的一种方法是贝叶斯多研究因子分析（BMSFA），确定了所有研究常见的潜在因素，以及特定于个体研究的潜在因素。但是，BMSFA不允许部分共享因素，即超过一个但比所有研究少的潜在因素。我们通过引入一种新方法Tetris来扩展BMSFA，用于贝叶斯组合多研究因子分析，该因素可以识别可通过任何研究组合可以共享的潜在因素。我们对与印度自助餐过程共享潜在因素的研究子集建模。我们通过广泛的仿真测试方法，不仅在降低尺寸，而且在协方差估计中展示了其效用。最后，我们将俄罗斯分子应用于高维基因表达数据集，以识别乳腺癌基因表达的模式，无论是在生殖线突变所定义的已知类别内和跨乳腺癌表达的模式。

Analyzing multiple studies allows leveraging data from a range of sources and populations, but until recently, there have been limited methodologies to approach the joint unsupervised analysis of multiple high-dimensional studies. A recent method, Bayesian Multi-Study Factor Analysis (BMSFA), identifies latent factors common to all studies, as well as latent factors specific to individual studies. However, BMSFA does not allow for partially shared factors, i.e. latent factors shared by more than one but less than all studies. We extend BMSFA by introducing a new method, Tetris, for Bayesian combinatorial multi-study factor analysis, which identifies latent factors that can be shared by any combination of studies. We model the subsets of studies that share latent factors with an Indian Buffet Process. We test our method with an extensive range of simulations, and showcase its utility not only in dimension reduction but also in covariance estimation. Finally, we apply Tetris to high-dimensional gene expression datasets to identify patterns in breast cancer gene expression, both within and across known classes defined by germline mutations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题