不平衡数据集的后验重新校准

论文标题

不平衡数据集的后验重新校准

Posterior Re-calibration for Imbalanced Datasets

论文作者

Tian, Junjiao, Liu, Yen-Cheng, Glaser, Nathan, Hsu, Yen-Chang, Kira, Zsolt

论文摘要

当训练标签分布严重不平衡时，以及测试数据与训练分布不同时，神经网络的性能会很差。为了处理导致不平衡的测试标签分布的变化，我们从最佳贝叶斯分类器的角度激发了问题，并得出了可以通过基于KL-Divergence的优化来解决的训练后的重新平衡技术。这种方法允许在验证集上有效调整灵活的训练后超参数，并有效地修改分类器边距以应对这种不平衡。我们进一步将这种方法与现有的似然转移方法相结合，从相同的贝叶斯角度重新解释了它们，并证明我们的方法可以以统一的方式处理这两个问题。所得算法可以方便地用于概率分类问题不可知的基础体系结构。我们在六个不同的数据集和五个不同架构上的结果显示了最先准确的准确性，包括大规模不平衡数据集（例如用于分类的inaturalist和用于语义分割的合成）。请参阅https://github.com/gt-ripl/uno-ic.git进行实施。

Neural Networks can perform poorly when the training label distribution is heavily imbalanced, as well as when the testing data differs from the training distribution. In order to deal with shift in the testing label distribution, which imbalance causes, we motivate the problem from the perspective of an optimal Bayes classifier and derive a post-training prior rebalancing technique that can be solved through a KL-divergence based optimization. This method allows a flexible post-training hyper-parameter to be efficiently tuned on a validation set and effectively modify the classifier margin to deal with this imbalance. We further combine this method with existing likelihood shift methods, re-interpreting them from the same Bayesian perspective, and demonstrating that our method can deal with both problems in a unified way. The resulting algorithm can be conveniently used on probabilistic classification problems agnostic to underlying architectures. Our results on six different datasets and five different architectures show state of art accuracy, including on large-scale imbalanced datasets such as iNaturalist for classification and Synthia for semantic segmentation. Please see https://github.com/GT-RIPL/UNO-IC.git for implementation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题