单镜头对所有互动对分类

论文标题

单镜头对所有互动对分类

Classifying All Interacting Pairs in a Single Shot

论文作者

Chafik, Sanaa, Orcesi, Astrid, Audigier, Romaric, Luvison, Bertrand

论文摘要

在本文中，我们基于Calipso（单镜头中的所有相互作用对分类），引入了一种新型的人类相互作用检测方法，这是人类对象相互作用的分类器。这种新的单杆交互分类器对所有人类对象对的相互作用均同时估算相互作用，而不管它们的数量和类别如何。最先进的方法基于一组人类对象候选对的相互作用的成对估计来采用多拍策略，这至少取决于相互作用的数量，或者最多最多根据候选对数的数量而导致复杂性。相反，提出的方法估计了整个图像上的相互作用。实际上，它通过在整个图像中执行单个正向通行证，同时估计所有人类受试者和对象目标之间的所有相互作用。因此，它导致恒定的复杂性和计算时间独立于图像中对象，对象或相互作用的数量。详细介绍地，由于联合多任务网络同时学习了三个互补任务：（i）对相互作用类型的预测，（ii）估算目标的存在和（iii）学习嵌入的嵌入对象，通过使用元素学习策略对嵌入对互动的互动和目标进行相互作用的估算，该联合的多任务网络详细地在锚点上实现了相互作用分类。此外，我们引入了一个以对象为中心的被动式动词估计，可显着改善结果。对两个众所周知的人类对象相互作用图像数据集的评估V-Coco和Hico-Det，与最新的ART相比，所提出的方法的竞争力（第二名），而无论图像中的对象数量和相互作用的数量如何。

In this paper, we introduce a novel human interaction detection approach, based on CALIPSO (Classifying ALl Interacting Pairs in a Single shOt), a classifier of human-object interactions. This new single-shot interaction classifier estimates interactions simultaneously for all human-object pairs, regardless of their number and class. State-of-the-art approaches adopt a multi-shot strategy based on a pairwise estimate of interactions for a set of human-object candidate pairs, which leads to a complexity depending, at least, on the number of interactions or, at most, on the number of candidate pairs. In contrast, the proposed method estimates the interactions on the whole image. Indeed, it simultaneously estimates all interactions between all human subjects and object targets by performing a single forward pass throughout the image. Consequently, it leads to a constant complexity and computation time independent of the number of subjects, objects or interactions in the image. In detail, interaction classification is achieved on a dense grid of anchors thanks to a joint multi-task network that learns three complementary tasks simultaneously: (i) prediction of the types of interaction, (ii) estimation of the presence of a target and (iii) learning of an embedding which maps interacting subject and target to a same representation, by using a metric learning strategy. In addition, we introduce an object-centric passive-voice verb estimation which significantly improves results. Evaluations on the two well-known Human-Object Interaction image datasets, V-COCO and HICO-DET, demonstrate the competitiveness of the proposed method (2nd place) compared to the state-of-the-art while having constant computation time regardless of the number of objects and interactions in the image.

下载PDF全文

下载文献需遵守相关版权规定

论文标题