论文标题
用大词汇探索长的尾巴视觉关系识别
Exploring Long Tail Visual Relationship Recognition with Large Vocabulary
论文作者
论文摘要
最近的文献中提出了几种方法,以减轻长尾问题,主要是在对象分类任务中。在本文中,我们进行了有关长尾视觉关系识别任务(LTVRR)的第一个大规模研究。 LTVRR旨在改善从长尾(例如,“兔子在草地上放牧”)的结构性视觉关系的学习。在此设置中,主题,关系和对象类都遵循长尾分布。为了开始我们的研究并为社区建立未来的基准,我们介绍了两个与LTVRR相关的基准,该基准为VG8K-LT和GQA-LT,该基准构建在广泛使用的视觉基因组和GQA数据集中。我们使用这些基准测试来研究LTVRR设置上几种最先进的长尾模型的性能。最后,我们提出了一种Visiol语言Hubless(VILHUB)损失和一种适用于LTVRR设置的混合增强技术,称为Relmix。 Vilhub和Relmix都可以轻松地集成在现有型号的顶部,尽管很简单,但我们的结果表明它们可以显着提高性能,尤其是在尾巴上。基准,代码和模型已在以下网址提供:https://github.com/vision-cair/ltvrr。
Several approaches have been proposed in recent literature to alleviate the long-tail problem, mainly in object classification tasks. In this paper, we make the first large-scale study concerning the task of Long-Tail Visual Relationship Recognition (LTVRR). LTVRR aims at improving the learning of structured visual relationships that come from the long-tail (e.g., "rabbit grazing on grass"). In this setup, the subject, relation, and object classes each follow a long-tail distribution. To begin our study and make a future benchmark for the community, we introduce two LTVRR-related benchmarks, dubbed VG8K-LT and GQA-LT, built upon the widely used Visual Genome and GQA datasets. We use these benchmarks to study the performance of several state-of-the-art long-tail models on the LTVRR setup. Lastly, we propose a visiolinguistic hubless (VilHub) loss and a Mixup augmentation technique adapted to LTVRR setup, dubbed as RelMix. Both VilHub and RelMix can be easily integrated on top of existing models and despite being simple, our results show that they can remarkably improve the performance, especially on tail classes. Benchmarks, code, and models have been made available at: https://github.com/Vision-CAIR/LTVRR.