细粒度的谓词学习场景图生成

论文标题

细粒度的谓词学习场景图生成

Fine-Grained Predicates Learning for Scene Graph Generation

论文作者

Lyu, Xinyu, Gao, Lianli, Guo, Yuyu, Zhao, Zhou, Huang, Hao, Shen, Heng Tao, Song, Jingkuan

论文摘要

当前场景生成模型的性能受到一些难以弥补的谓词的严重阻碍，例如“女性/站在/站立/在河边行走”或“女人 - 纽尔/看/在孩子面前看/看/看”。尽管通用SGG模型容易预测头部谓词，而现有的重新平衡策略则偏爱尾部类别，但它们都无法适当处理这些难以呈现的谓词。为了解决这个问题，灵感来自细颗粒的图像分类，该分类的重点是在难以弥补的对象类之间进行区分，我们提出了一种名为“细粒度谓词学习”（FGPL）的方法，该方法旨在区分场景图生成任务的难以分辨的谓词。具体而言，我们首先引入了一个谓词晶格，该晶格可以帮助SGG模型找出细粒度的谓词对。然后，利用谓词晶格，我们提出了一个歧视损失的类别和一个歧视损失的类别，这两个损失都有助于区分细粒谓词，同时维持对可识别质量的歧视性的歧视性。所提出的模型 - 不合Snostic策略可显着提高三种基准模型（变压器，VCTREE和基序）的性能22.8 \％，24.1 \％\％和21.7 \％的平均召回率（MR@100）在谓词分类子任务中。我们的模型还以较大的边距（即6.1 \％，4.6 \％和3.2 \％的平均召回率（MR@100））优于最先进的方法。

The performance of current Scene Graph Generation models is severely hampered by some hard-to-distinguish predicates, e.g., "woman-on/standing on/walking on-beach" or "woman-near/looking at/in front of-child". While general SGG models are prone to predict head predicates and existing re-balancing strategies prefer tail categories, none of them can appropriately handle these hard-to-distinguish predicates. To tackle this issue, inspired by fine-grained image classification, which focuses on differentiating among hard-to-distinguish object classes, we propose a method named Fine-Grained Predicates Learning (FGPL) which aims at differentiating among hard-to-distinguish predicates for Scene Graph Generation task. Specifically, we first introduce a Predicate Lattice that helps SGG models to figure out fine-grained predicate pairs. Then, utilizing the Predicate Lattice, we propose a Category Discriminating Loss and an Entity Discriminating Loss, which both contribute to distinguishing fine-grained predicates while maintaining learned discriminatory power over recognizable ones. The proposed model-agnostic strategy significantly boosts the performances of three benchmark models (Transformer, VCTree, and Motif) by 22.8\%, 24.1\% and 21.7\% of Mean Recall (mR@100) on the Predicate Classification sub-task, respectively. Our model also outperforms state-of-the-art methods by a large margin (i.e., 6.1\%, 4.6\%, and 3.2\% of Mean Recall (mR@100)) on the Visual Genome dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题