论文标题
图像解析的图形推理变压器
Graph Reasoning Transformer for Image Parsing
论文作者
论文摘要
从经验上证明,捕获长期依赖性在各种计算机视觉任务上具有有效性。通过多头注意机制的帮助,通过使用变压器框架来实现这一主题的进步。但是,基于注意力的图像贴片相互作用可能会遭受类内部斑块的冗余相互作用和阶层间斑块的无调相互作用的问题。在本文中,我们提出了一种新颖的图形推理变压器(Great),用于解析图像,以使图像贴片能够按照关系推理模式进行相互作用。具体而言,首先将线性嵌入的图像贴片投影到图形空间中,其中每个节点代表一个图像斑块群的隐式视觉中心,每个边缘都反映了两个相邻节点之间的关系权重。之后,全局关系推理相应地在此图上执行。最后,包括关系信息在内的所有节点都映射回原始空间以进行后续过程。与常规变压器相比,GREAT具有更高的交互效率和更有目的的交互模式。实验是在具有挑战性的城市景观和ADE20K数据集上进行的。结果表明,在最先进的变压器基线上,具有轻微的计算开销,可以实现一致的性能增长。
Capturing the long-range dependencies has empirically proven to be effective on a wide range of computer vision tasks. The progressive advances on this topic have been made through the employment of the transformer framework with the help of the multi-head attention mechanism. However, the attention-based image patch interaction potentially suffers from problems of redundant interactions of intra-class patches and unoriented interactions of inter-class patches. In this paper, we propose a novel Graph Reasoning Transformer (GReaT) for image parsing to enable image patches to interact following a relation reasoning pattern. Specifically, the linearly embedded image patches are first projected into the graph space, where each node represents the implicit visual center for a cluster of image patches and each edge reflects the relation weight between two adjacent nodes. After that, global relation reasoning is performed on this graph accordingly. Finally, all nodes including the relation information are mapped back into the original space for subsequent processes. Compared to the conventional transformer, GReaT has higher interaction efficiency and a more purposeful interaction pattern. Experiments are carried out on the challenging Cityscapes and ADE20K datasets. Results show that GReaT achieves consistent performance gains with slight computational overheads on the state-of-the-art transformer baselines.