论文标题
隐藏变量的观测数据中的因果查询
Causal query in observational data with hidden variables
论文作者
论文摘要
本文讨论了具有隐藏变量的观测数据中因果查询的问题,目的是在“操纵”变量时寻求变化,同时给定一组可见的混杂变量,这些变量影响了操纵变量和结果。这样的“数据实验”以估计操纵变量的因果效应可用于使用历史数据验证实验设计或研究新关系时探索混杂因素。但是,现有的因果效应估计的数据驱动方法面临一些主要挑战,包括具有高维数据的可伸缩性差,由于全球因果结构学习算法所使用的启发式方法而引起的较低估计精度以及当数据中隐藏变量不可避免的数据时,因果关系的假设。在本文中,我们开发了一个定理,用于使用局部搜索来查找在现实的预处理假设下观察性数据的因果效应估计的调整(或混杂)变量的超集。该定理确保因调整变量超集估计的因果效应的一组因果效应的无偏估计值。基于开发的定理,我们为因果查询提出了一种数据驱动算法。实验表明,与现有的数据驱动的因果效应估计方法相比,所提出的算法更快,并产生更好的因果效应估计方法。提出的算法估计的因果效应与使用域知识的最新方法一样准确。
This paper discusses the problem of causal query in observational data with hidden variables, with the aim of seeking the change of an outcome when "manipulating" a variable while given a set of plausible confounding variables which affect the manipulated variable and the outcome. Such an "experiment on data" to estimate the causal effect of the manipulated variable is useful for validating an experiment design using historical data or for exploring confounders when studying a new relationship. However, existing data-driven methods for causal effect estimation face some major challenges, including poor scalability with high dimensional data, low estimation accuracy due to heuristics used by the global causal structure learning algorithms, and the assumption of causal sufficiency when hidden variables are inevitable in data. In this paper, we develop a theorem for using local search to find a superset of the adjustment (or confounding) variables for causal effect estimation from observational data under a realistic pretreatment assumption. The theorem ensures that the unbiased estimate of causal effect is included in the set of causal effects estimated by the superset of adjustment variables. Based on the developed theorem, we propose a data-driven algorithm for causal query. Experiments show that the proposed algorithm is faster and produces better causal effect estimation than an existing data-driven causal effect estimation method with hidden variables. The causal effects estimated by the proposed algorithm are as accurate as those by the state-of-the-art methods using domain knowledge.