视觉问题回答体系结构的组件分析

论文标题

视觉问题回答体系结构的组件分析

Component Analysis for Visual Question Answering Architectures

论文作者

Kolling, Camila, Wehrmann, Jônatas, Barros, Rodrigo C.

论文摘要

计算机视觉和自然语言处理的最新研究进展引入了新的任务，这些任务为解决AI完整问题铺平了道路。这些任务之一称为视觉问题回答（VQA）。 VQA系统必须对图像进行图像和自由形式的开放式自然语言问题，并在输出中产生自然语言答案。这样的任务引起了科学界的极大关注，科学界产生了许多旨在提高VQA预测准确性的方法。其中大多数包括三个主要组成部分：（i）图像和问题的独立表示；（ii）功能融合，因此模型可以使用两个来源的信息来回答视觉问题；（iii）自然语言的正确答案。最近引入了许多方法，目前尚不清楚每个组件对模型的最终性能的真正贡献。本文的主要目的是提供有关每个组件在VQA模型中的影响的全面分析。我们广泛的实验集涵盖了视觉和文本元素，以及以融合和注意机制形式的这些表示的组合。我们的主要贡献是确定培训VQA模型的核心组成部分，以最大程度地提高其预测性能。

Recent research advances in Computer Vision and Natural Language Processing have introduced novel tasks that are paving the way for solving AI-complete problems. One of those tasks is called Visual Question Answering (VQA). A VQA system must take an image and a free-form, open-ended natural language question about the image, and produce a natural language answer as the output. Such a task has drawn great attention from the scientific community, which generated a plethora of approaches that aim to improve the VQA predictive accuracy. Most of them comprise three major components: (i) independent representation learning of images and questions; (ii) feature fusion so the model can use information from both sources to answer visual questions; and (iii) the generation of the correct answer in natural language. With so many approaches being recently introduced, it became unclear the real contribution of each component for the ultimate performance of the model. The main goal of this paper is to provide a comprehensive analysis regarding the impact of each component in VQA models. Our extensive set of experiments cover both visual and textual elements, as well as the combination of these representations in form of fusion and attention mechanisms. Our major contribution is to identify core components for training VQA models so as to maximize their predictive performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题