关系变压器网络

论文标题

关系变压器网络

Relation Transformer Network

论文作者

Koner, Rajat, Shit, Suprosanna, Tresp, Volker

论文摘要

以对象为节点和相互关系的场景图的提取是对图像内容深入了解的基础。尽管最近的进展，例如消息传递和联合分类，但由于对视觉对象之间相互相互作用的次优探索，视觉关系的检测仍然是一项具有挑战性的任务。在这项工作中，我们提出了一种新型的变压器公式，以用于场景图的产生和关系预测。我们利用变压器的编码器架构来嵌入节点和边缘的丰富功能。具体而言，我们与变压器编码器的自我注意事项以及与变压器解码器的交叉注意的边缘对相互作用建模了节点对节点的相互作用。此外，我们引入了一种新型的位置嵌入，适合在解码器中处理边缘。最后，我们的关系预测模块从学习的节点和边缘嵌入中分类了定向关系。我们将此架构命名为关系变压器网络（RTN）。在视觉基因组和GQA数据集上，与最先进的方法相比，我们的总体平均值为4.85％和3.1％。我们的实验表明，关系变压器可以在各种数据集中有效地对上下文进行模拟，并具有小规模的关系分类。

The extraction of a scene graph with objects as nodes and mutual relationships as edges is the basis for a deep understanding of image content. Despite recent advances, such as message passing and joint classification, the detection of visual relationships remains a challenging task due to sub-optimal exploration of the mutual interaction among the visual objects. In this work, we propose a novel transformer formulation for scene graph generation and relation prediction. We leverage the encoder-decoder architecture of the transformer for rich feature embedding of nodes and edges. Specifically, we model the node-to-node interaction with the self-attention of the transformer encoder and the edge-to-node interaction with the cross-attention of the transformer decoder. Further, we introduce a novel positional embedding suitable to handle edges in the decoder. Finally, our relation prediction module classifies the directed relation from the learned node and edge embedding. We name this architecture as Relation Transformer Network (RTN). On the Visual Genome and GQA dataset, we have achieved an overall mean of 4.85% and 3.1% point improvement in comparison with state-of-the-art methods. Our experiments show that Relation Transformer can efficiently model context across various datasets with small, medium, and large-scale relation classification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题