论文标题
玫瑰是红色的,紫罗兰是蓝色的...但是VQA应该期望它们吗?
Roses Are Red, Violets Are Blue... but Should Vqa Expect Them To?
论文作者
论文摘要
视觉问题回答的模型(VQA)臭名昭著,因为它们倾向于依靠数据集偏见,因为涉及的问题和概念的巨大而不平衡的多样性,并且倾向于阻止模型学习推理,而导致他们进行了有根据的猜测。在本文中,我们声称标准评估指标包括衡量整体内域准确性,这是误导性的。由于问题和概念是不平衡的,因此这倾向于偏爱利用微妙的训练集统计数据的模型。另外,天真地引入火车和测试拆分之间的人工分配变化也不完全满足。首先,这些变化并不能反映现实世界的趋势,从而导致了不合适的模型。其次,由于班次是手工制作的,因此训练有素的模型是专门为这种特定设置设计的,并且不推广到其他配置。我们提出了旨在克服这些问题的GQA-OOD基准测试:我们测量和比较稀有和频繁提出的答案对的准确性,并认为前者更适合评估推理能力,我们通过实验性地验证了该模型或多或少地利用偏见的模型。在一项涉及7种VQA模型和3种偏差技术的大规模研究中,我们还实验表明,这些模型未能解决涉及不经常概念的问题,并为未来的研究方向提供了建议。
Models for Visual Question Answering (VQA) are notorious for their tendency to rely on dataset biases, as the large and unbalanced diversity of questions and concepts involved and tends to prevent models from learning to reason, leading them to perform educated guesses instead. In this paper, we claim that the standard evaluation metric, which consists in measuring the overall in-domain accuracy, is misleading. Since questions and concepts are unbalanced, this tends to favor models which exploit subtle training set statistics. Alternatively, naively introducing artificial distribution shifts between train and test splits is also not completely satisfying. First, the shifts do not reflect real-world tendencies, resulting in unsuitable models; second, since the shifts are handcrafted, trained models are specifically designed for this particular setting, and do not generalize to other configurations. We propose the GQA-OOD benchmark designed to overcome these concerns: we measure and compare accuracy over both rare and frequent question-answer pairs, and argue that the former is better suited to the evaluation of reasoning abilities, which we experimentally validate with models trained to more or less exploit biases. In a large-scale study involving 7 VQA models and 3 bias reduction techniques, we also experimentally demonstrate that these models fail to address questions involving infrequent concepts and provide recommendations for future directions of research.