论文标题

适用于微生物群落分析的高维组成数据的强大协方差估计

Robust Covariance Estimation for High-dimensional Compositional Data with Application to Microbial Communities Analysis

论文作者

He, Yong, Liu, Pengfei, Zhang, Xinsheng, Zhou, Wang

论文摘要

由于如今的高通量测序技术的快速发展,微生物群落分析正在引起人们的注意。观察到的数据具有以下典型特征:它是高维,成分(位于单纯形中),甚至由于过度丰富的分类单元的存在,甚至将是Leptokurtic且高度偏斜的,这使得常规相关性分析使得研究的同时性和共截然性关系之间的共同分解关系是不可避免的。在本文中,我们解决了此类数据的协方差估计的挑战。假设位于公认的稀疏协方差矩阵中的基础协方差矩阵,我们在文献中采用了一个代理矩阵,称为中心的log-Ratio协方差矩阵,与真实的基础协方差矩阵差异大致与尺寸无关。我们为中心的对数值协方差矩阵构建了均值(MOM)估计器,并提出了适应各个条目可变性的阈值过程。与文献中的次高西度条件相比,通过施加弱得多的有限第四刻条件,我们在光谱规范下得出了最佳的收敛速率。此外,我们还提供有关支持恢复的理论保证。当存在异常值或重尾时,MOM估计器的自适应阈值程序易于实现,并且会获得鲁棒性。进行了彻底的仿真研究,以显示所提出的程序比某些最新方法的优势。最后,我们应用了提出的方法来分析人类肠道中的微生物组数据集。实现该方法的R脚本可在https://github.com/heyongstat/rcec上获得。

Microbial communities analysis is drawing growing attention due to the rapid development of high-throughput sequencing techniques nowadays. The observed data has the following typical characteristics: it is high-dimensional, compositional (lying in a simplex) and even would be leptokurtic and highly skewed due to the existence of overly abundant taxa, which makes the conventional correlation analysis infeasible to study the co-occurrence and co-exclusion relationship between microbial taxa. In this article, we address the challenges of covariance estimation for this kind of data. Assuming the basis covariance matrix lying in a well-recognized class of sparse covariance matrices, we adopt a proxy matrix known as centered log-ratio covariance matrix in the literature, which is approximately indistinguishable from the real basis covariance matrix as the dimensionality tends to infinity. We construct a Median-of-Means (MOM) estimator for the centered log-ratio covariance matrix and propose a thresholding procedure that is adaptive to the variability of individual entries. By imposing a much weaker finite fourth moment condition compared with the sub-Gaussianity condition in the literature, we derive the optimal rate of convergence under the spectral norm. In addition, we also provide theoretical guarantee on support recovery. The adaptive thresholding procedure of the MOM estimator is easy to implement and gains robustness when outliers or heavy-tailedness exist. Thorough simulation studies are conducted to show the advantages of the proposed procedure over some state-of-the-arts methods. At last, we apply the proposed method to analyze a microbiome dataset in human gut. The R script for implementing the method is available at https://github.com/heyongstat/RCEC.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源