视频播放率感知自我监视的速度代表性学习

论文标题

视频播放率感知自我监视的速度代表性学习

Video Playback Rate Perception for Self-supervisedSpatio-Temporal Representation Learning

论文作者

Yao, Yuan, Liu, Chang, Luo, Dezhao, Zhou, Yu, Ye, Qixiang

论文摘要

在自我监督的时空表示学习中，时间分辨率和长期术语特征尚未得到充分探索，这限制了学习模型的表示能力。在本文中，我们提出了一种新颖的自我监督方法，称为视频播放率感知（PRP），以一种简单的方式学习时空表示。 PRP根源在扩张的采样策略中，该策略生成了有关表示模型学习的视频播放率的自我审视信号。 PRP是通过功能编码器，分类模块和重建解码器实现的，以以协作歧视生成方式实现时空语义保留。歧视感知模型遵循特征编码器，通过对快速前进的速率进行分类，更喜欢低时间分辨率和长期表示。生成感知模型充当特征解码器，专注于通过引入运动注意机制来理解高时间分辨率和短期表示。 PRP应用于典型的视频目标任务，包括动作识别和视频检索。实验表明，PRP优于最先进的自我监督模型，并具有明显的利润。代码可从github.com/yuanyao366/prp获得

In self-supervised spatio-temporal representation learning, the temporal resolution and long-short term characteristics are not yet fully explored, which limits representation capabilities of learned models. In this paper, we propose a novel self-supervised method, referred to as video Playback Rate Perception (PRP), to learn spatio-temporal representation in a simple-yet-effective way. PRP roots in a dilated sampling strategy, which produces self-supervision signals about video playback rates for representation model learning. PRP is implemented with a feature encoder, a classification module, and a reconstructing decoder, to achieve spatio-temporal semantic retention in a collaborative discrimination-generation manner. The discriminative perception model follows a feature encoder to prefer perceiving low temporal resolution and long-term representation by classifying fast-forward rates. The generative perception model acts as a feature decoder to focus on comprehending high temporal resolution and short-term representation by introducing a motion-attention mechanism. PRP is applied on typical video target tasks including action recognition and video retrieval. Experiments show that PRP outperforms state-of-the-art self-supervised models with significant margins. Code is available at github.com/yuanyao366/PRP

下载PDF全文

下载文献需遵守相关版权规定

论文标题