分层自我监督的代表学学习电影理解

论文标题

分层自我监督的代表学学习电影理解

Hierarchical Self-supervised Representation Learning for Movie Understanding

论文作者

Xiao, Fanyi, Kundu, Kaustav, Tighe, Joseph, Modolo, Davide

论文摘要

大多数自我监督的视频表示学习方法都集中在行动识别上。相比之下，在本文中，我们着重于自我监督的视频学习，以进行电影理解，并提出了一种新颖的层次自我监督训练预处理策略，该策略分别预修了我们的等级电影理解模型的每个级别（基于[37]）。具体而言，我们建议使用对比度学习目标为低级视频主链预处理，同时使用事件掩盖预测任务预处理更高级别的视频上下文化器，该任务可以使用不同的数据源来预处理不同级别的层次结构。我们首先表明我们的自我监督预处理策略是有效的，并在Vidsitu基准上的所有任务和指标上提高了性能[37]（例如，将语义角色预测从47％提高到61％的苹果酒评分）。我们进一步证明了情境化事件特征在LVU任务上的有效性[54]，无论是单独使用时，以及与实例功能结合时，都表明了它们的互补性。

Most self-supervised video representation learning approaches focus on action recognition. In contrast, in this paper we focus on self-supervised video learning for movie understanding and propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model (based on [37]). Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task, which enables the usage of different data sources for pretraining different levels of the hierarchy. We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark [37] (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores). We further demonstrate the effectiveness of our contextualized event features on LVU tasks [54], both when used alone and when combined with instance features, showing their complementarity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题