为时空识别的图像预训练

论文标题

为时空识别的图像预训练

In Defense of Image Pre-Training for Spatiotemporal Recognition

论文作者

Li, Xianhang, Wang, Huiyu, Wei, Chen, Mei, Jieru, Yuille, Alan, Zhou, Yuyin, Xie, Cihang

论文摘要

图像预训练，当前用于广泛视觉任务的当前事实范式在视频识别领域中通常不太受青睐。相比之下，一种共同的策略是直接从刮擦中直接使用时空卷积神经网络（CNN）训练。尽管如此，有趣的是，通过仔细研究这些从划痕学到的CNN，我们注意到存在某些3D内核比其他人具有更强的外观建模能力，可以说表明外观信息在学习中已经很好地散布了。受到这一观察的启发，我们假设有效利用图像预训练的关键在于学习空间和时间特征的分解，并将图像预训练作为初始化3D内核之前的外观。此外，我们提出了空间可分离（STS）卷积，该卷积将特征通道明确将特征通道拆分为空间和时间基团，以进一步使时空特征更彻底地分解3D CNN。我们的实验表明，简单地用ST替换3D卷积会显着改善3D CNN的范围，而无需增加参数和计算动力学400和某些东西的V2。此外，这条新的培训管道始终以明显的速度在视频识别方面取得更好的结果。例如，我们在强大的256-段128-gpu基线上获得了 +0.6％的慢速1慢速1，而仅对4个GPU的50个时代进行微调。代码和模型可在https://github.com/ucsc-vlaa/image-pretraining-for-video上找到。

Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.

下载PDF全文

下载文献需遵守相关版权规定

论文标题