从词汇扰动中学习，以确保一致的视觉问题回答

论文标题

从词汇扰动中学习，以确保一致的视觉问题回答

Learning from Lexical Perturbations for Consistent Visual Question Answering

论文作者

Whitehead, Spencer, Wu, Hui, Fung, Yi Ren, Ji, Heng, Feris, Rogerio, Saenko, Kate

论文摘要

现有的视觉问题回答（VQA）模型通常对输入变化敏感。在本文中，我们提出了一种基于模块化网络解决此问题的新颖方法，该方法通过语言扰动创建了两个问题，并将它们之间的视觉推理过程正规化为在培训期间保持一致。我们表明，我们的框架显着提高了一致性和泛化能力，证明了受控语言扰动的价值是VQA模型的有用且目前未被充分利用的培训和正则化工具。我们还提出了VQA扰动配对（VQA P2），这是一种新的低成本基准和增强管道，可创建VQA问题的可控语言变化。我们的基准标准独特地借鉴了大规模的语言资源，与生成方法相比，在保持数据质量的同时，避免了人类注释的工作。我们使用VQA P2对现有的VQA模型进行基准测试，并对每种类型的语言变化提供鲁棒性分析。

Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations. In this paper, we propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations and regularizes the visual reasoning process between them to be consistent during training. We show that our framework markedly improves consistency and generalization ability, demonstrating the value of controlled linguistic perturbations as a useful and currently underutilized training and regularization tool for VQA models. We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations of VQA questions. Our benchmark uniquely draws from large-scale linguistic resources, avoiding human annotation effort while maintaining data quality compared to generative approaches. We benchmark existing VQA models using VQA P2 and provide robustness analysis on each type of linguistic variation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题