大量数据的计算有效的单变量过滤

论文标题

大量数据的计算有效的单变量过滤

Computationally efficient univariate filtering for massive data

论文作者

Tsagris, M., Alenazi, A., Fafalios, S.

论文摘要

大规模，大量和大数据的广泛可用性增加了数据分析的计算成本。一种情况是单变量滤波的计算成本，该计算成本通常涉及拟合许多单变量回归模型，对于减少预测变量数量的大量可变选择算法至关重要。本文表现出如何通过采用分数测试或简单的Pearson相关性（或二进制响应的t检验）来大大降低计算成本。与似然比测试相比，广泛的蒙特卡洛模拟研究将证明它们的优势和缺点，并具有真实数据的示例将说明分数测试的性能以及在现实的场景下的对数类样比测试的表现。根据所使用的回归模型，得分测试的速度比对数似然比测试快30-60,000倍，并且产生了几乎相同的结果。因此，本文强烈建议在应对大规模数据，大量数据，大数据甚至样本大小的数据量为几万或更高的数据时，将对数可能比率测试替换为分数测试。

The vast availability of large scale, massive and big data has increased the computational cost of data analysis. One such case is the computational cost of the univariate filtering which typically involves fitting many univariate regression models and is essential for numerous variable selection algorithms to reduce the number of predictor variables. The paper manifests how to dramatically reduce that computational cost by employing the score test or the simple Pearson correlation (or the t-test for binary responses). Extensive Monte Carlo simulation studies will demonstrate their advantages and disadvantages compared to the likelihood ratio test and examples with real data will illustrate the performance of the score test and the log-likelihood ratio test under realistic scenarios. Depending on the regression model used, the score test is 30 - 60,000 times faster than the log-likelihood ratio test and produces nearly the same results. Hence this paper strongly recommends to substitute the log-likelihood ratio test with the score test when coping with large scale data, massive data, big data, or even with data whose sample size is in the order of a few tens of thousands or higher.

下载PDF全文

下载文献需遵守相关版权规定

论文标题