论文标题
可扩展的估计和大规模或在线生存数据的推断
Scalable Estimation and Inference with Large-scale or Online Survival Data
论文作者
论文摘要
随着许多科学学科中数据收集和聚合技术的快速发展,进行大规模或在线回归以分析现实世界数据并揭示现实世界的证据变得越来越无处不在。在这样的应用程序中,将整个数据集存储在内存中通常具有挑战性或有时是不可行的。因此,涉及整个数据集的基于批处理的经典估计方法不那么吸引人或不再适用。取而代之的是,递归估计方法,例如随机梯度下降,过程数据点依次更具吸引力,表现出数值便利性和记忆效率。在本文中,为了估算大型或在线生存数据,我们提出了一种随机梯度下降方法,随着数据点在流中依次到达时,该方法以在线方式递归更新估计值。建立了诸如渐近正态性和估计效率之类的理论结果,以证明其有效性是合理的。此外,为了量化与所提出的随机梯度下降估计器相关的不确定性并促进统计推断,我们制定了可扩展的重新采样策略,该策略专门针对大型或在线设置。还提供了模拟研究和实际数据应用程序来评估其性能并说明其实际实用性。
With the rapid development of data collection and aggregation technologies in many scientific disciplines, it is becoming increasingly ubiquitous to conduct large-scale or online regression to analyze real-world data and unveil real-world evidence. In such applications, it is often numerically challenging or sometimes infeasible to store the entire dataset in memory. Consequently, classical batch-based estimation methods that involve the entire dataset are less attractive or no longer applicable. Instead, recursive estimation methods such as stochastic gradient descent that process data points sequentially are more appealing, exhibiting both numerical convenience and memory efficiency. In this paper, for scalable estimation of large or online survival data, we propose a stochastic gradient descent method which recursively updates the estimates in an online manner as data points arrive sequentially in streams. Theoretical results such as asymptotic normality and estimation efficiency are established to justify its validity. Furthermore, to quantify the uncertainty associated with the proposed stochastic gradient descent estimator and facilitate statistical inference, we develop a scalable resampling strategy that specifically caters to the large-scale or online setting. Simulation studies and a real data application are also provided to assess its performance and illustrate its practical utility.