MixtConv：混合时间卷积内核，可有效地识别

论文标题

MixtConv：混合时间卷积内核，可有效地识别

MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recogntion

论文作者

Shan, Kaiyu, Wang, Yongtao, Wang, Zhuoying, Liang, Tingting, Tang, Zhi, Chen, Ying, Li, Yangyan

论文摘要

为了有效提取视频的时空特征以进行动作识别，大多数最先进的方法将1D时间卷积整合到常规的2D CNN主链中。但是，他们都利用了网络构建块中固定核大小（即3）的1D时间卷积，因此具有次优的时间建模能力来处理长期和短期动作。为了解决这个问题，我们首先研究了不同内核大小对1D时间卷积过滤器的影响。然后，我们提出了一个称为混合时间卷积（MixTConv）的简单而有效的操作，该操作由具有不同内核大小的多个深度1D卷积过滤器组成。通过将MixTConv插入常规的2D CNN骨架RESNET-50中，我们进一步提出了一个名为MSTNET的高效网络体系结构，以进行动作识别，并在多个基准测试上实现最先进的结果。

To efficiently extract spatiotemporal features of video for action recognition, most state-of-the-art methods integrate 1D temporal convolution into a conventional 2D CNN backbone. However, they all exploit 1D temporal convolution of fixed kernel size (i.e., 3) in the network building block, thus have suboptimal temporal modeling capability to handle both long-term and short-term actions. To address this problem, we first investigate the impacts of different kernel sizes for the 1D temporal convolutional filters. Then, we propose a simple yet efficient operation called Mixed Temporal Convolution (MixTConv), which consists of multiple depthwise 1D convolutional filters with different kernel sizes. By plugging MixTConv into the conventional 2D CNN backbone ResNet-50, we further propose an efficient and effective network architecture named MSTNet for action recognition, and achieve state-of-the-art results on multiple benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题