论文标题
事实:高维随机森林推断
FACT: High-Dimensional Random Forests Inference
论文作者
论文摘要
量化各个特征在随机森林中的实用性可以大大提高其解释性。现有的研究表明,一些普遍使用的特征对随机森林的重要性措施遭受了偏见问题。此外,对于大多数现有方法,缺乏全面的尺寸和功率分析。在本文中,我们通过假设检验解决了问题,并提出了一个自相应的特征 - 占用性相关测试(事实)的框架,以评估具有偏置抗性属性的随机森林模型中给定特征的重要性,在这种情况下,我们的零假设涉及该功能是否有条件地独立于响应所有其他特征。在随机森林推断上的这种努力是由高维随机森林一致性的一些最新发展所赋予的。在具有依赖功能的相当普遍的高维非参数模型设置下,我们正式确定事实可以提供理论上有理由的功能重要性测试,并具有I型I型错误并享受吸引人的功率属性。新建议的方法的理论结果和有限样本优势用几个模拟示例和经济预测应用进行了说明。
Quantifying the usefulness of individual features in random forests learning can greatly enhance its interpretability. Existing studies have shown that some popularly used feature importance measures for random forests suffer from the bias issue. In addition, there lack comprehensive size and power analyses for most of these existing methods. In this paper, we approach the problem via hypothesis testing, and suggest a framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature in the random forests model with bias-resistance property, where our null hypothesis concerns whether the feature is conditionally independent of the response given all other features. Such an endeavor on random forests inference is empowered by some recent developments on high-dimensional random forests consistency. Under a fairly general high-dimensional nonparametric model setting with dependent features, we formally establish that FACT can provide theoretically justified feature importance test with controlled type I error and enjoy appealing power property. The theoretical results and finite-sample advantages of the newly suggested method are illustrated with several simulation examples and an economic forecasting application.