知识促使几次动作识别

论文标题

知识促使几次动作识别

Knowledge Prompting for Few-shot Action Recognition

论文作者

Shi, Yuheng, Wu, Xinxiao, Lin, Hanxi

论文摘要

视频中很少有动作的识别是挑战性的，因为它缺乏监督和难以推广到看不见的行动。为了解决这项任务，我们提出了一种简单而有效的方法，称为知识提示，该方法利用了对外部资源的行动的常识知识，以提示强大的预训练的视觉语言模型，以进行几次分类。我们首先收集有关定义为文本建议的动作的大规模语言描述，以建立动作知识基础。文本提案的收集是通过用外部动作相关语料库填充手工艺句子模板或从网络教学视频字幕上提取与动作相关的短语来完成的。最后，我们设计了一个轻巧的时间建模网络，以捕获分类的动作语义的时间演变。六个基准数据集的扩展实验表明，我们的方法通常可以实现最先进的性能，同时将培训开销降低到现有方法的0.001。

Few-shot action recognition in videos is challenging for its lack of supervision and difficulty in generalizing to unseen actions. To address this task, we propose a simple yet effective method, called knowledge prompting, which leverages commonsense knowledge of actions from external resources to prompt a powerful pre-trained vision-language model for few-shot classification. We first collect large-scale language descriptions of actions, defined as text proposals, to build an action knowledge base. The collection of text proposals is done by filling in handcraft sentence templates with external action-related corpus or by extracting action-related phrases from captions of Web instruction videos.Then we feed these text proposals into the pre-trained vision-language model along with video frames to generate matching scores of the proposals to each frame, and the scores can be treated as action semantics with strong generalization. Finally, we design a lightweight temporal modeling network to capture the temporal evolution of action semantics for classification.Extensive experiments on six benchmark datasets demonstrate that our method generally achieves the state-of-the-art performance while reducing the training overhead to 0.001 of existing methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题