论文标题
模型性能的可解释的元估计
Interpretable Meta-Measure for Model Performance
论文作者
论文摘要
评估模型性能的基准在机器学习中起重要作用。但是,没有确定的方法来描述和创建新的基准。此外,最常见的基准测试采用了具有多个限制的性能指标。例如,两个模型的性能差异没有概率的解释,没有参考点可以指示它们是否代表了重大改进,并且比较数据集之间的此类差异是没有意义的。我们介绍了一种名为基于ELO的预测能力(EPP)的新的元评分评估,该评估构建在其他性能指标之上,并允许对模型进行可解释的比较。 EPP分数的差异具有概率的解释,可以直接比较数据集之间,此外,基于逻辑回归的设计允许根据偏差统计数据评估排名适应性。我们证明了EPP的数学属性,并通过30个分类数据集的大规模基准和视觉数据的现实世界基准进行了大规模基准的经验结果。此外,我们提出了一个统一的基准本体论,用于对基准进行统一的描述。
Benchmarks for the evaluation of model performance play an important role in machine learning. However, there is no established way to describe and create new benchmarks. What is more, the most common benchmarks use performance measures that share several limitations. For example, the difference in performance for two models has no probabilistic interpretation, there is no reference point to indicate whether they represent a significant improvement, and it makes no sense to compare such differences between data sets. We introduce a new meta-score assessment named Elo-based Predictive Power (EPP) that is built on top of other performance measures and allows for interpretable comparisons of models. The differences in EPP scores have a probabilistic interpretation and can be directly compared between data sets, furthermore, the logistic regression-based design allows for an assessment of ranking fitness based on a deviance statistic. We prove the mathematical properties of EPP and support them with empirical results of a large scale benchmark on 30 classification data sets and a real-world benchmark for visual data. Additionally, we propose a Unified Benchmark Ontology that is used to give a uniform description of benchmarks.