使用保证金排名下的区域识别标签错误的数据

论文标题

使用保证金排名下的区域识别标签错误的数据

Identifying Mislabeled Data using the Area Under the Margin Ranking

论文作者

Pleiss, Geoff, Zhang, Tianyi, Elenberg, Ethan R., Weinberger, Kilian Q.

论文摘要

并非典型培训集中的所有数据都有助于概括；有些样本可能过于模棱两可或刻有标签。本文介绍了一种新方法，以识别此类样本并在训练神经网络时减轻其影响。我们算法的核心是边缘（AUM）统计量的区域，它利用了清洁和标记样品的训练动力学差异。一个简单的过程 - 添加一个额外的类，该类别填充有目的地标记的阈值样本 - 学习了一个AUM上限，该界限可以隔离错误标签的数据。这种方法对合成和现实世界数据集的先前工作始终改善。在WebVision50分类任务上，我们的方法删除了17％的培训数据，从而在测试错误中产生1.6％（绝对）的改善。在CIFAR100上，删除13％的数据导致错误下降1.2％。

Not all data in a typical training set help with generalization; some samples can be overly ambiguous or outrightly mislabeled. This paper introduces a new method to identify such samples and mitigate their impact when training neural networks. At the heart of our algorithm is the Area Under the Margin (AUM) statistic, which exploits differences in the training dynamics of clean and mislabeled samples. A simple procedure - adding an extra class populated with purposefully mislabeled threshold samples - learns a AUM upper bound that isolates mislabeled data. This approach consistently improves upon prior work on synthetic and real-world datasets. On the WebVision50 classification task our method removes 17% of training data, yielding a 1.6% (absolute) improvement in test error. On CIFAR100 removing 13% of the data leads to a 1.2% drop in error.

下载PDF全文

下载文献需遵守相关版权规定

论文标题