视频变压器：调查

论文标题

视频变压器：调查

Video Transformers: A Survey

论文作者

Selva, Javier, Johansen, Anders S., Escalera, Sergio, Nasrollahi, Kamal, Moeslund, Thomas B., Clapés, Albert

论文摘要

变压器模型显示了处理长期交互的巨大成功，使其成为建模视频的有前途的工具。但是，它们缺乏感应偏见，并且具有输入长度的四边形。当处理时间维度引入的高维度时，这些限制进一步加剧了。尽管有调查分析变压器的视觉进展，但没有一个专注于对视频特定设计的深入分析。在这项调查中，我们分析了利用变形金刚为模型视频的作品的主要贡献和趋势。具体来说，我们首先深入研究了如何在输入级别上处理视频。然后，我们研究以更有效地处理视频，减少冗余，重新引入有用的感应偏见并捕获长期时间动态的建筑变化。此外，我们还概述了不同的培训制度，并探讨了视频的有效自我监管的学习策略。最后，我们对视频变压器最常见的基准（即动作分类）进行了性能比较，即使计算复杂性较小，它们也以优于3D Convnets的表现。

Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced by the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey, we analyze the main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled at the input level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition, we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.

下载PDF全文

下载文献需遵守相关版权规定

论文标题