从视频流的开发设备学习中实现弱监督的时间动作本地化

论文标题

从视频流的开发设备学习中实现弱监督的时间动作本地化

Enabling Weakly-Supervised Temporal Action Localization from On-Device Learning of the Video Stream

论文作者

Tang, Yue, Wu, Yawen, Zhou, Peipei, Hu, Jingtong

论文摘要

在视频中检测动作已被广泛应用于设备应用程序。实用的设备视频总是没有动作和背景来构成。希望模型既要识别动作类别，又可以将动作发生的时间位置定位。这样的任务称为“时间动作位置”（TAL），该位置始终在收集和标记多个未修剪视频的云上训练。希望TAL模型可以连续和本地从新数据中学习，这可以直接提高动作检测精度，同时保护客户的隐私。但是，训练TAL模型是不平凡的，因为需要具有时间注释的大量视频样本。但是，按框架进行注释的视频非常耗时且昂贵。尽管已经提出了仅使用视频级标签的未修剪视频来学习弱监督的TAL（W-TAL），但这种方法也不适合在设备学习方案中。在实用的设备学习应用中，在流中收集数据。将如此长的视频流分为多个视频片段需要大量的人为努力，这阻碍了将TAL任务应用于现实的设备学习应用程序的探索。为了使W-TAL模型能够从长时间的未修剪流视频中学习，我们提出了一种有效的视频学习方法，可以直接适应新的环境。我们首先提出了一种自适应视频划分方法，采用基于对比分数的部分合并方法将视频流转换为多个段。然后，我们探索TAL任务上的不同采样策略，以要求尽可能少的标签。据我们所知，我们是直接从设备，长时间视频流中学习的首次尝试。

Detecting actions in videos have been widely applied in on-device applications. Practical on-device videos are always untrimmed with both action and background. It is desirable for a model to both recognize the class of action and localize the temporal position where the action happens. Such a task is called temporal action location (TAL), which is always trained on the cloud where multiple untrimmed videos are collected and labeled. It is desirable for a TAL model to continuously and locally learn from new data, which can directly improve the action detection precision while protecting customers' privacy. However, it is non-trivial to train a TAL model, since tremendous video samples with temporal annotations are required. However, annotating videos frame by frame is exorbitantly time-consuming and expensive. Although weakly-supervised TAL (W-TAL) has been proposed to learn from untrimmed videos with only video-level labels, such an approach is also not suitable for on-device learning scenarios. In practical on-device learning applications, data are collected in streaming. Dividing such a long video stream into multiple video segments requires lots of human effort, which hinders the exploration of applying the TAL tasks to realistic on-device learning applications. To enable W-TAL models to learn from a long, untrimmed streaming video, we propose an efficient video learning approach that can directly adapt to new environments. We first propose a self-adaptive video dividing approach with a contrast score-based segment merging approach to convert the video stream into multiple segments. Then, we explore different sampling strategies on the TAL tasks to request as few labels as possible. To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.

下载PDF全文

下载文献需遵守相关版权规定

论文标题