论文标题
几何和准确性的随机森林接近
Geometry- and Accuracy-Preserving Random Forest Proximities
论文作者
论文摘要
随机森林被认为是最佳的开箱外分类和回归算法之一,因为它们的高度预测性能较高,而调整相对较少。可以从训练有素的随机森林中计算成对接近,并测量数据点相对于监督任务的相似性。随机森林接近已在许多应用中使用,包括识别可变重要性,数据插补,异常检测和数据可视化。但是,现有的随机森林接近性定义不能准确反映随机森林所学的数据几何形状。在本文中,我们介绍了一种新颖的定义,对随机森林接近的新定义称为随机森林几何形状和准确性保护近端(RF-GAP)。我们证明,使用RF-GAP的接近权力加权总和(回归)或多数投票(分类)完全匹配了袋外的随机森林预测,从而捕获了随机森林学到的数据几何形状。我们从经验上表明,这种改善的几何表示在数据插补等任务中的传统随机森林近端优于传统的随机森林接近,并提供了与学习的数据几何形状一致的离群检测和可视化结果。
Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest and measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly matches the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.