注释有效的未修剪视频动作识别

论文标题

注释有效的未修剪视频动作识别

Annotation-Efficient Untrimmed Video Action Recognition

论文作者

Zou, Yixiong, Zhang, Shanghang, Chen, Guangyao, Tian, Yonghong, Keutzer, Kurt, Moura, José M. F.

论文摘要

深度学习在识别视频动作方面取得了巨大的成功，但是培训数据的收集和注释仍然很费力，这主要在两个方面：（1）所需的带注释的数据的数量很大；（2）暂时注释每个动作的位置是耗时的。已经提出了诸如少少学习或未经修剪的视频识别之类的作品来处理一个方面或另一个方面。但是，很少有现有作品可以同时处理这两个问题。在本文中，我们针对一个新问题，注释有效的视频识别，以减少大量样本和动作位置的注释要求。由于两个方面，此类问题具有挑战性：（1）未修剪的视频只有较弱的监督；（2）与当前利益行为无关的视频片段（背景，BG）可能包含新颖类中的利益动作（前景，FG），这是一种广泛现有的现象，但很少以几次未绘制的未绘制视频识别进行研究。为了实现这一目标，通过分析BG的财产，我们将BG分类为信息丰富的BG（IBG）和非信息性BG（NBG），我们建议（1）基于开放式检测的方法来找到NBG和FG，（2）在自我中学习IBG和自我区分的对比的学习方法，以更好地了解自我的自我范围，并以AN的方式了解A A A A A A A A A A A a A A A A A A A A A A A A A A A A A A A A AFE（3））（3）（3），（3），（3），（3），（3），（3），（3），（3），（3），（3），（3），（3），（3），（3）and and and and and and（3） IBG和FG。 ActivityNet V1.2和ActivityNet V1.3的广泛实验验证了所提出方法的理由和有效性。

Deep learning has achieved great success in recognizing video actions, but the collection and annotation of training data are still quite laborious, which mainly lies in two aspects: (1) the amount of required annotated data is large; (2) temporally annotating the location of each action is time-consuming. Works such as few-shot learning or untrimmed video recognition have been proposed to handle either one aspect or the other. However, very few existing works can handle both issues simultaneously. In this paper, we target a new problem, Annotation-Efficient Video Recognition, to reduce the requirement of annotations for both large amount of samples and the action location. Such problem is challenging due to two aspects: (1) the untrimmed videos only have weak supervision; (2) video segments not relevant to current actions of interests (background, BG) could contain actions of interests (foreground, FG) in novel classes, which is a widely existing phenomenon but has rarely been studied in few-shot untrimmed video recognition. To achieve this goal, by analyzing the property of BG, we categorize BG into informative BG (IBG) and non-informative BG (NBG), and we propose (1) an open-set detection based method to find the NBG and FG, (2) a contrastive learning method to learn IBG and distinguish NBG in a self-supervised way, and (3) a self-weighting mechanism for the better distinguishing of IBG and FG. Extensive experiments on ActivityNet v1.2 and ActivityNet v1.3 verify the rationale and effectiveness of the proposed methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题