时间动作定位的自适应感知变压器

论文标题

时间动作定位的自适应感知变压器

Adaptive Perception Transformer for Temporal Action Localization

论文作者

Ouyang, Yizheng, Zhang, Tianjin, Gu, Weibo, Wang, Hongfa

论文摘要

时间动作本地化旨在预测未修剪长视频中每个动作实例的边界和类别。以前的大多数基于锚或建议的方法忽略了整个视频序列中的全局本地上下文相互作用。此外，他们的多阶段设计无法直接生成动作边界和类别。为了解决上述问题，本文提出了一个端到端模型，称为自适应感知变压器（简称AdaperFormer）。具体而言，Adaperformer探索了双分支的注意机制。一个分支会照顾全球感知的关注，该注意力可以模拟整个视频序列并汇总全球相关环境。而其他分支集中在局部卷积移位上，以通过我们的双向移动操作来汇总框架内和框架间信息。端到端性质在没有额外步骤的情况下产生视频动作的边界和类别。提供了广泛的实验以及消融研究，以揭示我们设计的有效性。我们的方法在Thumos14和ActivityNet-1.3数据集上获得了竞争性能。

Temporal action localization aims to predict the boundary and category of each action instance in untrimmed long videos. Most of previous methods based on anchors or proposals neglect the global-local context interaction in entire video sequences. Besides, their multi-stage designs cannot generate action boundaries and categories straightforwardly. To address the above issues, this paper proposes a end-to-end model, called Adaptive Perception transformer (AdaPerFormer for short). Specifically, AdaPerFormer explores a dual-branch attention mechanism. One branch takes care of the global perception attention, which can model entire video sequences and aggregate global relevant contexts. While the other branch concentrates on the local convolutional shift to aggregate intra-frame and inter-frame information through our bidirectional shift operation. The end-to-end nature produces the boundaries and categories of video actions without extra steps. Extensive experiments together with ablation studies are provided to reveal the effectiveness of our design. Our method obtains competitive performance on the THUMOS14 and ActivityNet-1.3 dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题