论文标题
小型数据集的更好的分类器校准
Better Classifier Calibration for Small Data Sets
论文作者
论文摘要
分类器校准并不总是与分类器分开类的能力并驾齐驱。在某些应用程序中,良好的分类器校准,即产生准确的概率估计的能力比班级分离更为重要。当培训数据量有限时,改善校准的传统方法开始崩溃。在本文中,我们展示了如何生成更多的校准数据,可以在许多情况下可以改善校准算法的性能,因为分类器并非自然产生良好的输出,传统方法失败。所提出的方法增加了计算成本,但是考虑到主要用例是在小数据集的情况下,此额外的计算成本保持微不足道,并且与预测时间中的其他方法相媲美。从测试的分类器中,随机森林和天真的贝叶斯分类器检测到最大的改进。因此,在有限的训练数据量且良好的校准至关重要时,至少对于那些分类器,可以推荐提出的方法。
Classifier calibration does not always go hand in hand with the classifier's ability to separate the classes. There are applications where good classifier calibration, i.e. the ability to produce accurate probability estimates, is more important than class separation. When the amount of data for training is limited, the traditional approach to improve calibration starts to crumble. In this article we show how generating more data for calibration is able to improve calibration algorithm performance in many cases where a classifier is not naturally producing well-calibrated outputs and the traditional approach fails. The proposed approach adds computational cost but considering that the main use case is with small data sets this extra computational cost stays insignificant and is comparable to other methods in prediction time. From the tested classifiers the largest improvement was detected with the random forest and naive Bayes classifiers. Therefore, the proposed approach can be recommended at least for those classifiers when the amount of data available for training is limited and good calibration is essential.