论文标题
基于统计深度的归一化和基因表达数据的异常值检测
Statistical Depth based Normalization and Outlier Detection of Gene Expression Data
论文作者
论文摘要
归一化和异常检测属于基因表达数据的预处理。我们提出了一种基于统计数据深度的自然归一化过程,该程序将基因表达的基因表达的分布归一化。这与标准的分位数归一化方法不同,基于缺乏一维中位数的众所周知特性的坐标中间阵列。统计数据深度保持了这些良好的特性。基因表达数据以包含异常值而闻名。尽管已经广泛研究了给定基因表达数据集中的异常基因,但考虑到高维度却带来了困难,但数据的样本尺寸结构较低,这些方法并不适用于检测异常值样本。用于检测离群样本的标准程序是视觉和基于降低技术的视觉效果;实例是多维缩放和光谱图图。为了检测给定基因表达数据集中的离群基因,我们提出了一个分析程序,并基于Tukey的离群值概念和统计深度的概念,因为先前的方法论导致了统一和不法异常值。我们揭示了四个数据集的离群值;作为进一步研究的必要步骤。
Normalization and outlier detection belong to the preprocessing of gene expression data. We propose a natural normalization procedure based on statistical data depth which normalizes to the distribution of gene expressions of the most representative gene expression of the group. This differ from the standard method of quantile normalization, based on the coordinate-wise median array that lacks of the well-known properties of the one-dimensional median. The statistical data depth maintains those good properties. Gene expression data are known for containing outliers. Although detecting outlier genes in a given gene expression dataset has been broadly studied, these methodologies do not apply for detecting outlier samples, given the difficulties posed by the high dimensionality but low sample size structure of the data. The standard procedures used for detecting outlier samples are visual and based on dimension reduction techniques; instances are multidimensional scaling and spectral map plots. For detecting outlier genes in a given gene expression dataset, we propose an analytical procedure and based on the Tukey's concept of outlier and the notion of statistical depth, as previous methodologies lead to unassertive and wrongful outliers. We reveal the outliers of four datasets; as a necessary step for further research.