线性产生的数据的线性预测因子缺失值：非一致性和解决方案

论文标题

线性产生的数据的线性预测因子缺失值：非一致性和解决方案

Linear predictor on linearly-generated data with missing values: non consistency and solutions

论文作者

Morvan, Marine Le, Prost, Nicolas, Josse, Julie, Scornet, Erwan, Varoquaux, Gaël

论文摘要

当数据缺少值时，我们考虑构建预测变量。我们研究了看似简单的情况，在该情况下，预测目标是完全观察到的数据的线性函数，我们表明，在存在缺失值的情况下，最佳预测变量可能不是线性的。在特定的高斯情况下，可以将其写入观察到的数据与各种缺失值指标之间的多路相互作用的线性函数。由于其内在的复杂性，我们研究了一个简单的近似值，并证明了有限样本的概括性界限，突出了每种方法都能表现最佳的机制。然后，我们证明具有Relu激活功能的多层感知器可以保持一致，并且可以探索真实模型和近似值之间的良好权衡。我们的研究强调了有趣的模型家族，这些模型有益于与缺少值相吻合，具体取决于可用的数据量。

We consider building predictors when the data have missing values. We study the seemingly-simple case where the target to predict is a linear function of the fully-observed data and we show that, in the presence of missing values, the optimal predictor may not be linear. In the particular Gaussian case, it can be written as a linear function of multiway interactions between the observed data and the various missing-value indicators. Due to its intrinsic complexity, we study a simple approximation and prove generalization bounds with finite samples, highlighting regimes for which each method performs best. We then show that multilayer perceptrons with ReLU activation functions can be consistent, and can explore good trade-offs between the true model and approximations. Our study highlights the interesting family of models that are beneficial to fit with missing values depending on the amount of data available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题