论文标题
VQA的视觉接地方法出于错误的原因而工作!
Visual Grounding Methods for VQA are Working for the Wrong Reasons!
论文作者
论文摘要
现有的视觉问题回答(VQA)方法倾向于利用数据集偏见和虚假的统计相关性,而不是出于正确的原因而产生正确的答案。为了解决这个问题,最近对VQA的偏见缓解方法提议结合视觉提示(例如人类注意图),以更好地扎根VQA模型,展示令人印象深刻的增长。但是,我们表明,性能的改进不是视觉接地改进的结果,而是正规化效果,从而防止了与语言先验的过度拟合。例如,我们发现实际上并不需要提供适当的基于人类的线索。随机,无意义的提示也会导致类似的改进。基于此观察结果,我们提出了一个更简单的正则化方案,该方案不需要任何外部注释,但在VQA-CPV2上取得了几乎最先进的表现。
Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.