论文标题
MGA-VQA:视觉问题回答的多粒度对齐
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering
论文作者
论文摘要
学会回答视觉问题是一项具有挑战性的任务,因为多模式输入在两个特征空间内。此外,视觉问题回答中的推理要求模型同时了解图像和问题,并在同一空间中对齐它们,而不是简单地记住有关问题解答对的统计信息。因此,必须在不同方式和每种方式之间找到组件连接,以获得更好的关注。以前的作品直接学习了关注权重。但是,改进是有限的,因为这两个模式特征在两个领域中:图像特征是高度多样的,缺乏结构和语法规则作为语言,而自然语言特征的可能性更高,可能会丢失详细信息。为了更好地了解视觉和文本之间的注意力,我们关注如何构建输入分层和嵌入结构信息,以改善不同级别组件之间的对齐。我们提出了用于视觉问题答案任务(MGA-VQA)的多粒性对齐结构,该架构通过多粒度对齐来学习内部和模式间相关性,并通过决策融合模块输出最终结果。与以前的工作相反,我们的模型将对齐方式分为不同的级别,以实现更好的学习,而无需其他数据和注释。 VQA-V2和GQA数据集上的实验表明,我们的模型在没有额外的数据和注释的情况下显着优于两个数据集上的未经预先最先进的方法。此外,它甚至可以在GQA上的预训练方法上取得更好的结果。
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces. Moreover, reasoning in visual question answering requires the model to understand both image and question, and align them in the same space, rather than simply memorize statistics about the question-answer pairs. Thus, it is essential to find component connections between different modalities and within each modality to achieve better attention. Previous works learned attention weights directly on the features. However, the improvement is limited since these two modality features are in two domains: image features are highly diverse, lacking structure and grammatical rules as language, and natural language features have a higher probability of missing detailed information. To better learn the attention between visual and text, we focus on how to construct input stratification and embed structural information to improve the alignment between different level components. We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA), which learns intra- and inter-modality correlations by multi-granularity alignment, and outputs the final result by the decision fusion module. In contrast to previous works, our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations. The experiments on the VQA-v2 and GQA datasets demonstrate that our model significantly outperforms non-pretrained state-of-the-art methods on both datasets without extra pretraining data and annotations. Moreover, it even achieves better results over the pre-trained methods on GQA.