论文标题
指导视觉问题,请注意
Guiding Visual Question Answering with Attention Priors
论文作者
论文摘要
现代视觉推理系统的当前成功可以说是跨模式的注意机制。但是,在诸如VQA之类的审议推理中,每个步骤都没有受到关注,因此可以用作统计合并机制,而不是旨在选择与推论相关的信息的语义操作。这是因为在训练时,注意力仅由推理链末端的非常稀疏的信号(即答案标签)引导。这会导致交叉模式的注意力重量偏离所需的视觉结合。为了纠正这一偏差,我们建议使用明确的语言 - 视觉接地来指导注意机制。通过将查询中的结构化语言概念与视觉对象之间的指称人联系起来,从而得出了这种接地。在这里,我们仅凭问题和图像的配对来学习基础,而无需回答注释或外部接地监督。这种接地通过双重机制来指导VQA模型内的注意力机制:预训练的注意力计算,并在推理时直接逐案指导重量。所得算法能够探测基于注意力的推理模型,注入相关的关联知识并调节核心推理过程。这种可扩展的增强功能提高了VQA模型的性能,使其对有限访问监督数据的稳健性并提高了可解释性。
The current success of modern visual reasoning systems is arguably attributed to cross-modality attention mechanisms. However, in deliberative reasoning such as in VQA, attention is unconstrained at each step, and thus may serve as a statistical pooling mechanism rather than a semantic operation intended to select information relevant to inference. This is because at training time, attention is only guided by a very sparse signal (i.e. the answer label) at the end of the inference chain. This causes the cross-modality attention weights to deviate from the desired visual-language bindings. To rectify this deviation, we propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. Here we learn the grounding from the pairing of questions and images alone, without the need for answer annotation or external grounding supervision. This grounding guides the attention mechanism inside VQA models through a duality of mechanisms: pre-training attention weight calculation and directly guiding the weights at inference time on a case-by-case basis. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process. This scalable enhancement improves the performance of VQA models, fortifies their robustness to limited access to supervised data, and increases interpretability.