论文标题
识别错误的预测样本:一种主动学习的方法
Identifying Wrongly Predicted Samples: A Method for Active Learning
论文作者
论文摘要
最先进的机器学习模型需要访问大量带注释的数据,以达到所需的性能水平。尽管未标记的数据可能在很大程度上可用,甚至丰富,但注释过程可能非常昂贵且限制。在假设某些样本对给定任务比其他样本更重要的假设,主动学习目标是确定应该获取注释的最有用的样本的问题。与其传统依赖模型不确定性作为利用新的未知标签的代理,我们提出了一个简单的样本选择标准,该标准超出了不确定性。通过首先接受模型预测,然后判断其对概括错误的影响,我们可以更好地识别错误的预测样本。我们进一步介绍了我们标准非常有效的近似值,并提供了基于相似性的解释。除了评估我们的积极学习标准基准的方法外,我们还考虑了不平等表示类别的不平衡数据的具有挑战性但现实的情况。我们在确定错误的预测样本方面显示出最新的结果和更好的速度。我们的方法是简单的,模型的不可知论,并且依赖于当前的模型状态,而无需从头开始重新训练。
State-of-the-art machine learning models require access to significant amount of annotated data in order to achieve the desired level of performance. While unlabelled data can be largely available and even abundant, annotation process can be quite expensive and limiting. Under the assumption that some samples are more important for a given task than others, active learning targets the problem of identifying the most informative samples that one should acquire annotations for. Instead of the conventional reliance on model uncertainty as a proxy to leverage new unknown labels, in this work we propose a simple sample selection criterion that moves beyond uncertainty. By first accepting the model prediction and then judging its effect on the generalization error, we can better identify wrongly predicted samples. We further present an approximation to our criterion that is very efficient and provides a similarity based interpretation. In addition to evaluating our method on the standard benchmarks of active learning, we consider the challenging yet realistic scenario of imbalanced data where categories are not equally represented. We show state-of-the-art results and better rates at identifying wrongly predicted samples. Our method is simple, model agnostic and relies on the current model status without the need for re-training from scratch.