组成视觉推理的基准

论文标题

组成视觉推理的基准

A Benchmark for Compositional Visual Reasoning

论文作者

Zerroug, Aimen, Vaishnav, Mohit, Colin, Julien, Musslick, Sebastian, Serre, Thomas

论文摘要

人类视野的基本组成部分是我们能够解析复杂的视觉场景并判断其组成对象之间的关系。近年来，随着最先进的系统在某些基准上达到人类的准确性，近年来，视觉推理的AI基准促进了快速的进步。然而，就样本效率而言，人类和AI系统学习新的视觉推理任务的样本效率仍然存在。人类在学习方面的非凡效率至少部分归因于其利用组成性的能力，以便他们在学习新任务时可以有效利用先前获得的知识。在这里，我们介绍了一种新颖的视觉推理基准组成视觉关系（CVR），以推动发展更多数据有效学习算法的进步。我们从流体智能和非语言推理测试中汲取灵感，并描述一种新的方法，用于创建抽象规则和相关图像数据集的组成。我们提出的基准包括跨任务规则的样本效率，概括和转移的度量，以及利用组成性的能力。我们系统地评估现代神经体系结构，发现令人惊讶的是，在大多数数据制度中，卷积架构在所有性能指标中都超过了基于变压器的体系结构。但是，即使在使用自学意义上学习信息性的视觉表示后，与人类相比，所有计算模型的数据效率要少得多。总体而言，我们希望我们的挑战能够激发人们对可以学会利用构图朝着更高效学习的神经体系结构发展的兴趣。

A fundamental component of human vision is our ability to parse complex visual scenes and judge the relations between their constituent objects. AI benchmarks for visual reasoning have driven rapid progress in recent years with state-of-the-art systems now reaching human accuracy on some of these benchmarks. Yet, a major gap remains in terms of the sample efficiency with which humans and AI systems learn new visual reasoning tasks. Humans' remarkable efficiency at learning has been at least partially attributed to their ability to harness compositionality -- such that they can efficiently take advantage of previously gained knowledge when learning new tasks. Here, we introduce a novel visual reasoning benchmark, Compositional Visual Relations (CVR), to drive progress towards the development of more data-efficient learning algorithms. We take inspiration from fluidic intelligence and non-verbal reasoning tests and describe a novel method for creating compositions of abstract rules and associated image datasets at scale. Our proposed benchmark includes measures of sample efficiency, generalization and transfer across task rules, as well as the ability to leverage compositionality. We systematically evaluate modern neural architectures and find that, surprisingly, convolutional architectures surpass transformer-based architectures across all performance measures in most data regimes. However, all computational models are a lot less data efficient compared to humans even after learning informative visual representations using self-supervision. Overall, we hope that our challenge will spur interest in the development of neural architectures that can learn to harness compositionality toward more efficient learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题