为什么Winoground艰难？调查视觉语言构图的失败

论文标题

为什么Winoground艰难？调查视觉语言构图的失败

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

论文作者

Diwan, Anuj, Berry, Layne, Choi, Eunsol, Harwath, David, Mahowald, Kyle

论文摘要

最近的Visuol语言预培训模型在各种最终任务（例如图像检索和视频字幕）上显示出令人鼓舞的进步。然而，它们在最近提出的Winoground数据集上惨败，该数据集挑战模型以匹配配对的图像和英语字幕，其物品构造以词汇重叠，但含义有所不同（例如，“草中有一个杯子”，而“杯子里有一些草”）。通过使用新的细粒标签对数据集进行注释，我们表明求解Winoground任务不仅需要组成语言的理解，还需要许多其他能力，例如常识性推理或定位低分辨率图像中的小型，过度的对象。在本文中，我们通过一系列有关相关任务（探索任务，图像检索任务），数据增强和手动检查数据集的实验来确定数据集的主要挑战。我们的分析表明，在视觉语言模型中的主要挑战可能在于融合视觉和文本表示，而不是构成语言的理解。我们在https://github.com/ajd12342/why-winoground-hard上发布注释和代码。

Recent visuolinguistic pre-trained models show promising progress on various end tasks such as image retrieval and video captioning. Yet, they fail miserably on the recently proposed Winoground dataset, which challenges models to match paired images and English captions, with items constructed to overlap lexically but differ in meaning (e.g., "there is a mug in some grass" vs. "there is some grass in a mug"). By annotating the dataset using new fine-grained tags, we show that solving the Winoground task requires not just compositional language understanding, but a host of other abilities like commonsense reasoning or locating small, out-of-focus objects in low-resolution images. In this paper, we identify the dataset's main challenges through a suite of experiments on related tasks (probing task, image retrieval task), data augmentation, and manual inspection of the dataset. Our analysis suggests that a main challenge in visuolinguistic models may lie in fusing visual and textual representations, rather than in compositional language understanding. We release our annotation and code at https://github.com/ajd12342/why-winoground-hard .

下载PDF全文

下载文献需遵守相关版权规定

论文标题