论文标题
VQA可能需要的只是图像标题
All You May Need for VQA are Image Captions
论文作者
论文摘要
视觉问题回答(VQA)从越来越复杂的模型中受益,但在数据创建方面并没有享有相同水平的参与度。在本文中,我们提出了一种通过利用现有图像捕获注释的丰富性与神经模型相结合的文本问题生成的方法,该方法会自动衍生VQA示例。我们表明,所得数据是高质量的。接受了我们数据训练的VQA模型通过两位数提高了最先进的零击精度,并达到了在接受人类通知的VQA数据训练的同一模型中缺乏的鲁棒性。
Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.