论文标题
通用线性模型中错误发现率控制的无标度方法
A Scale-free Approach for False Discovery Rate Control in Generalized Linear Models
论文作者
论文摘要
广义线性模型(GLM)已被广泛用于实践中,以建模非高斯响应变量。当解释性特征的数量相对较大时,科学研究人员很感兴趣地进行受控的特征选择,以简化下游分析。本文引入了一个新的GLM中特征选择框架,该框架可以在两个渐近方案中实现错误的发现率(FDR)控制。关键步骤是构造一个镜像统计量以衡量每个特征的重要性,该统计是基于两个(渐近)独立的估计值,对通过数据分解方法或高斯镜像方法获得的相应真实系数进行了独立估计。 FDR控制是通过利用镜像统计属性的优势来实现的,对于任何空功能,其采样分布(渐近)约为0。在中度维设置中,在该设置中,在该设置中,在该设置中,尺寸(功能数量)P与样本大小N收敛到固定值之间的比率(我们构建镜像统计量),我们基于最大值的镜子统计量。在P大得多的高维环境中,我们使用Debias Lasso来构建镜像统计量。与Benjamini-Hochberg的程序相比,它至关重要的是依赖于Z统计量的渐近正态性,因此所提出的方法仅是尺度的,因为它仅在对称特性上取决于对称特性,因此预计在有限样本的情况下会更强大。模拟结果和实际数据应用程序都表明,所提出的方法能够控制FDR,并且通常比现有方法更强大,包括Benjamini-Hochberg程序和仿冒滤波器。
The generalized linear models (GLM) have been widely used in practice to model non-Gaussian response variables. When the number of explanatory features is relatively large, scientific researchers are of interest to perform controlled feature selection in order to simplify the downstream analysis. This paper introduces a new framework for feature selection in GLMs that can achieve false discovery rate (FDR) control in two asymptotic regimes. The key step is to construct a mirror statistic to measure the importance of each feature, which is based upon two (asymptotically) independent estimates of the corresponding true coefficient obtained via either the data-splitting method or the Gaussian mirror method. The FDR control is achieved by taking advantage of the mirror statistic's property that, for any null feature, its sampling distribution is (asymptotically) symmetric about 0. In the moderate-dimensional setting in which the ratio between the dimension (number of features) p and the sample size n converges to a fixed value, we construct the mirror statistic based on the maximum likelihood estimation. In the high-dimensional setting where p is much larger than n, we use the debiased Lasso to build the mirror statistic. Compared to the Benjamini-Hochberg procedure, which crucially relies on the asymptotic normality of the Z statistic, the proposed methodology is scale free as it only hinges on the symmetric property, thus is expected to be more robust in finite-sample cases. Both simulation results and a real data application show that the proposed methods are capable of controlling the FDR, and are often more powerful than existing methods including the Benjamini-Hochberg procedure and the knockoff filter.