半监督学习以及真实与估计倾向分数的问题

论文标题

半监督学习以及真实与估计倾向分数的问题

Semi-supervised learning and the question of true versus estimated propensity scores

论文作者

Herren, Andrew, Hahn, P. Richard

论文摘要

如果观察到治疗分配和协变量，则将半监督的机器学习在治疗效果估计问题上的直接应用是将数据视为“未标记”，但结果尚未观察到。根据该公式，使用较小的标记数据集可以使用较小的标记数据集来估计高维倾向函数和因果推理的大型数据集，可以使用学习倾向分数通过加权估计器进行。在无标记数据的限制情况下，可以准确估计高维倾向函数。但是，因果推理社区中的长期建议表明，估计的倾向分数（仅来自标记的数据）实际上比真实倾向分数更可取，这意味着在这种情况下，未标记的数据实际上是没有用的。在本文中，我们研究了这个悖论，并提出了一个简单的程序，该程序阐明了强有力的直觉，即已知的倾向函数对于估算治疗效应的效果应与先前的文献估算效果有用。此外，仿真研究表明，在许多情况下，直接回归可能比逆强度估计器更可取。

A straightforward application of semi-supervised machine learning to the problem of treatment effect estimation would be to consider data as "unlabeled" if treatment assignment and covariates are observed but outcomes are unobserved. According to this formulation, large unlabeled data sets could be used to estimate a high dimensional propensity function and causal inference using a much smaller labeled data set could proceed via weighted estimators using the learned propensity scores. In the limiting case of infinite unlabeled data, one may estimate the high dimensional propensity function exactly. However, longstanding advice in the causal inference community suggests that estimated propensity scores (from labeled data alone) are actually preferable to true propensity scores, implying that the unlabeled data is actually useless in this context. In this paper we examine this paradox and propose a simple procedure that reconciles the strong intuition that a known propensity functions should be useful for estimating treatment effects with the previous literature suggesting otherwise. Further, simulation studies suggest that direct regression may be preferable to inverse-propensity weight estimators in many circumstances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题