论文标题
从浅层到深:视觉问题的图表上的组成推理回答
From Shallow to Deep: Compositional Reasoning over Graphs for Visual Question Answering
论文作者
论文摘要
为了实现一般的视觉问题回答(VQA)系统,必须学会回答需要在图像和外部知识上进行组成推理的更深入的问题。同时,推理过程应明确且可以解释,以了解模型的工作机制。对于人类而言,这毫不费力,但对于机器来说具有挑战性。在本文中,我们提出了一个分层图神经模块网络(HGNMN),它的原因是具有神经模块的多层图以解决上述问题。具体而言,我们首先通过视觉,语义和常识视图从多层图编码图像,因为支持答案的线索可能以不同的方式存在。我们的模型由几个精心设计的神经模块组成,这些神经模块在图表上执行特定功能,这些功能可用于在不同图表内和之间进行多步推理。与现有的模块化网络相比,我们将视觉推理从一个图扩展到更多图。我们可以根据模块的权重和图形关注来明确跟踪推理过程。实验表明,我们的模型不仅可以在CRIC数据集上实现最先进的性能,而且还获得了明确且可解释的推理程序。
In order to achieve a general visual question answering (VQA) system, it is essential to learn to answer deeper questions that require compositional reasoning on the image and external knowledge. Meanwhile, the reasoning process should be explicit and explainable to understand the working mechanism of the model. It is effortless for human but challenging for machines. In this paper, we propose a Hierarchical Graph Neural Module Network (HGNMN) that reasons over multi-layer graphs with neural modules to address the above issues. Specifically, we first encode the image by multi-layer graphs from the visual, semantic and commonsense views since the clues that support the answer may exist in different modalities. Our model consists of several well-designed neural modules that perform specific functions over graphs, which can be used to conduct multi-step reasoning within and between different graphs. Compared to existing modular networks, we extend visual reasoning from one graph to more graphs. We can explicitly trace the reasoning process according to module weights and graph attentions. Experiments show that our model not only achieves state-of-the-art performance on the CRIC dataset but also obtains explicit and explainable reasoning procedures.