视频表示学习的时空作物聚集

论文标题

视频表示学习的时空作物聚集

Spatio-Temporal Crop Aggregation for Video Representation Learning

论文作者

Sameni, Sepehr, Jenni, Simon, Favaro, Paolo

论文摘要

我们提出了用于视频表示学习（比例）的时空作物聚集，这是一种新颖的方法，在训练和推理时间都具有很高的可扩展性。我们的模型通过从具有预训练的主链提取的一组视频剪辑级功能中学习来构建远程视频功能。为了训练该模型，我们提出了一个由蒙版夹特征预测组成的自我监管的物体。我们仅通过重建稀疏输入来将稀疏性应用于输入，通过提取一组随机的视频剪辑以及损失函数应用。此外，我们通过在应用于单个视频片段的预训练的骨干的潜在空间中使用降低维度降低。这些技术使我们的方法不仅非常有效地训练，而且在转移学习方面也非常有效。我们证明，我们的视频表示形式通过线性，非线性和KNN探测通用的动作分类和视频理解数据集，从而产生最先进的性能。

We propose Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time. Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone. To train the model, we propose a self-supervised objective consisting of masked clip feature prediction. We apply sparsity to both the input, by extracting a random set of video clips, and to the loss function, by only reconstructing the sparse inputs. Moreover, we use dimensionality reduction by working in the latent space of a pre-trained backbone applied to single video clips. These techniques make our method not only extremely efficient to train but also highly effective in transfer learning. We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and KNN probing on common action classification and video understanding datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题