论文标题

分层自我监督的代表学学习电影理解

Hierarchical Self-supervised Representation Learning for Movie Understanding

论文作者

Xiao, Fanyi, Kundu, Kaustav, Tighe, Joseph, Modolo, Davide

论文摘要

大多数自我监督的视频表示学习方法都集中在行动识别上。相比之下,在本文中,我们着重于自我监督的视频学习,以进行电影理解,并提出了一种新颖的层次自我监督训练预处理策略,该策略分别预修了我们的等级电影理解模型的每个级别(基于[37])。具体而言,我们建议使用对比度学习目标为低级视频主链预处理,同时使用事件掩盖预测任务预处理更高级别的视频上下文化器,该任务可以使用不同的数据源来预处理不同级别的层次结构。我们首先表明我们的自我监督预处理策略是有效的,并在Vidsitu基准上的所有任务和指标上提高了性能[37](例如,将语义角色预测从47%提高到61%的苹果酒评分)。我们进一步证明了情境化事件特征在LVU任务上的有效性[54],无论是单独使用时,以及与实例功能结合时,都表明了它们的互补性。

Most self-supervised video representation learning approaches focus on action recognition. In contrast, in this paper we focus on self-supervised video learning for movie understanding and propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model (based on [37]). Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task, which enables the usage of different data sources for pretraining different levels of the hierarchy. We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark [37] (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores). We further demonstrate the effectiveness of our contextualized event features on LVU tasks [54], both when used alone and when combined with instance features, showing their complementarity.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源