论文标题
用于学习与视频相关任务的时空网络的比较
Comparison of Spatiotemporal Networks for Learning Video Related Tasks
论文作者
论文摘要
从视频序列中学习的许多方法都涉及从各个帧的时间处理2D CNN功能,或直接在高性能的2D CNN体系结构中直接利用3D卷积。重点通常是如何将时间处理纳入已经稳定的空间体系结构中。这项工作构建了一个基于MNIST的视频数据集,该数据集具有控制与视频相关任务的相关方面的参数:分类,订购和速度估计。在此数据集上训练的模型被证明在关键方面有所不同,具体取决于任务,使用2D卷积,3D卷积或卷积LSTM。经验分析表明,空间和时间维度之间的复杂,相互依赖的关系,设计选择对网络学习适当的时空特征的能力具有很大的影响。
Many methods for learning from video sequences involve temporally processing 2D CNN features from the individual frames or directly utilizing 3D convolutions within high-performing 2D CNN architectures. The focus typically remains on how to incorporate the temporal processing within an already stable spatial architecture. This work constructs an MNIST-based video dataset with parameters controlling relevant facets of common video-related tasks: classification, ordering, and speed estimation. Models trained on this dataset are shown to differ in key ways depending on the task and their use of 2D convolutions, 3D convolutions, or convolutional LSTMs. An empirical analysis indicates a complex, interdependent relationship between the spatial and temporal dimensions with design choices having a large impact on a network's ability to learn the appropriate spatiotemporal features.