论文标题
知道什么,何时何地:有效的视频动作建模
Knowing What, Where and When to Look: Efficient Video Action Modeling with Attention
论文作者
论文摘要
由于其在空间和时间上的丰富而冗余的信息,因此对无约束视频的动作识别至关重要。但是,在深层神经网络中引入行动识别的关注是具有挑战性的,原因有两个。首先,有效的注意模块需要学习什么(对象及其本地运动模式),(空间上)以及(在时间上)关注的地方。其次,视频注意力模块必须是有效的,因为现有的动作识别模型已经遭受了高计算成本。为了应对这两个挑战,提出了一个新颖的地方(W3)视频注意模块。我们的W3模块偏离了现有的替代方案,共同建模了视频关注的所有三个方面。至关重要的是,通过将高维视频特征数据分配到低维有意义的空间(“ whate”和2D空间张量“ Where'的1D通道向量)中,这是非常有效的,然后进行了轻巧的时间注意推理。广泛的实验表明,我们的注意力模型为现有的行动识别模型带来了重大改进,从而在许多基准上实现了新的最新性能。
Attentive video modeling is essential for action recognition in unconstrained videos due to their rich yet redundant information over space and time. However, introducing attention in a deep neural network for action recognition is challenging for two reasons. First, an effective attention module needs to learn what (objects and their local motion patterns), where (spatially), and when (temporally) to focus on. Second, a video attention module must be efficient because existing action recognition models already suffer from high computational cost. To address both challenges, a novel What-Where-When (W3) video attention module is proposed. Departing from existing alternatives, our W3 module models all three facets of video attention jointly. Crucially, it is extremely efficient by factorizing the high-dimensional video feature data into low-dimensional meaningful spaces (1D channel vector for `what' and 2D spatial tensors for `where'), followed by lightweight temporal attention reasoning. Extensive experiments show that our attention model brings significant improvements to existing action recognition models, achieving new state-of-the-art performance on a number of benchmarks.