论文标题
探索机器学习中训练设置的组成偏差如何影响识别稀有物体
An Exploration of How Training Set Composition Bias in Machine Learning Affects Identifying Rare Objects
论文作者
论文摘要
当训练机器学习分类器上的一个类本质上很少见的数据时,分类器通常会为稀有类分配太少的来源。为了解决这个问题,通常是将罕见类的示例提高重量,以确保它不忽略。由于相同的原因,训练源类型平衡更接近的限制数据也是一种常见的做法。在这里,我们表明这些实践可以将模型偏向于过度分配稀有阶级的过度分配来源。我们还探讨了如何检测训练数据偏差何时对训练的模型的预测以及如何减少偏见的影响产生统计学上的显着影响。尽管此处开发的技术的影响的大小会随应用程序的细节而变化,但在大多数情况下,它应该是适度的。但是,它们普遍适用于每次使用机器学习分类模型时,使它们类似于贝塞尔(Bessel)对样本方差的校正。
When training a machine learning classifier on data where one of the classes is intrinsically rare, the classifier will often assign too few sources to the rare class. To address this, it is common to up-weight the examples of the rare class to ensure it isn't ignored. It is also a frequent practice to train on restricted data where the balance of source types is closer to equal for the same reason. Here we show that these practices can bias the model toward over-assigning sources to the rare class. We also explore how to detect when training data bias has had a statistically significant impact on the trained model's predictions, and how to reduce the bias's impact. While the magnitude of the impact of the techniques developed here will vary with the details of the application, for most cases it should be modest. They are, however, universally applicable to every time a machine learning classification model is used, making them analogous to Bessel's correction to the sample variance.