使用最佳运输丢失数据

论文标题

使用最佳运输丢失数据

Missing Data Imputation using Optimal Transport

论文作者

Muzellec, Boris, Josse, Julie, Boyer, Claire, Cuturi, Marco

论文摘要

在将机器学习算法应用于现实世界数据集时，缺少数据是一个至关重要的问题。从一个简单的假设开始，即从同一数据集随机提取的两个批次应共享相同的分布，我们利用最佳传输距离来量化该标准并将其转变为损失函数以估算缺失的数据值。我们提出了使用端到端学习来最大程度地减少这些损失的实用方法，该学习可以利用或不利用价值的基础分布的参数假设。我们在MCAR，MAR和MNAR设置的UCI存储库中评估我们的方法。这些实验表明，基于OT的方法匹配或表现外的最先进的插补方法，即使对于缺少值的很高比例也是如此。

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

下载PDF全文

下载文献需遵守相关版权规定

论文标题