论文标题
面团什么时候成为百吉饼?分析Imagenet上的剩余错误
When does dough become a bagel? Analyzing the remaining mistakes on ImageNet
论文作者
论文摘要
在过去十年中,ImageNet数据集上的图像分类精度一直是计算机视觉进步的晴雨表。最近的几篇论文质疑了基准对社区仍然有用的程度,但是创新继续为性能贡献,当今最大的车型达到了90%+ TOP-1的准确性。为了使ImageNet上的进度有上下文化,并为当今最先进的模型提供了更有意义的评估,我们手动审查并分类了几个顶级模型所犯的所有剩余错误,以便深入了解计算机视觉中最基准的数据集之一。我们专注于ImageNet的多标签子集评估,当今最佳模型达到97%的TOP-1精度。我们的分析表明,几乎一半的假定错误根本不是错误,我们发现了新的有效的多标签,这表明,在没有仔细审查的情况下,我们显着低估了这些模型的性能。另一方面,我们还发现,当今最好的模型仍然犯了大量的错误(40%),这对人类评论者来说显然是错误的。为了校准ImageNet上的未来进度,我们提供了更新的多标签评估集,并且我们策划了Imagenet-Major:68个例子的“重大错误”切片,这是当今顶级模型所犯的明显错误 - 一个模型应该取得的完美之处,但如今却远远不够。
Image classification accuracy on the ImageNet dataset has been a barometer for progress in computer vision over the last decade. Several recent papers have questioned the degree to which the benchmark remains useful to the community, yet innovations continue to contribute gains to performance, with today's largest models achieving 90%+ top-1 accuracy. To help contextualize progress on ImageNet and provide a more meaningful evaluation for today's state-of-the-art models, we manually review and categorize every remaining mistake that a few top models make in order to provide insight into the long-tail of errors on one of the most benchmarked datasets in computer vision. We focus on the multi-label subset evaluation of ImageNet, where today's best models achieve upwards of 97% top-1 accuracy. Our analysis reveals that nearly half of the supposed mistakes are not mistakes at all, and we uncover new valid multi-labels, demonstrating that, without careful review, we are significantly underestimating the performance of these models. On the other hand, we also find that today's best models still make a significant number of mistakes (40%) that are obviously wrong to human reviewers. To calibrate future progress on ImageNet, we provide an updated multi-label evaluation set, and we curate ImageNet-Major: a 68-example "major error" slice of the obvious mistakes made by today's top models -- a slice where models should achieve near perfection, but today are far from doing so.