使用4D骨骼增强的上下文感知序列对齐

论文标题

使用4D骨骼增强的上下文感知序列对齐

Context-Aware Sequence Alignment using 4D Skeletal Augmentation

论文作者

Kwon, Taein, Tekin, Bugra, Tang, Siyu, Pollefeys, Marc

论文摘要

视频中细粒度的人类行为的时间对齐对于计算机视觉，机器人技术和混合现实中的众多应用都很重要。最先进的方法通过利用强大的深卷积神经网络直接学习基于图像的嵌入空间。虽然很简单，但它们的结果远非令人满意，但对齐视频表现出严重的时间不连续性，而没有其他后处理步骤。人体和手部姿势的最新进展有望解决视频中人类行动一致性任务的新方法。在这项工作中，基于现成的人类姿势估计量，我们提出了一种新颖的环境感知的自我监督学习体系结构，以使行动序列保持一致。我们将其命名为Casa。具体而言，CASA采用自我注意力和交叉注意机制来纳入人类行为的空间和时间背景，这可以解决时间不连续问题。此外，我们介绍了一种自我监督的学习计划，该计划由3D骨架表示的新颖4D增强技术赋予了能力。我们系统地评估方法的关键组成部分。我们在三个公共数据集上的实验表明，CASA显着改善了相位的进度，而Kendall的Tau得分比先前的最新方法。

Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality. State-of-the-art methods directly learn image-based embedding space by leveraging powerful deep convolutional neural networks. While being straightforward, their results are far from satisfactory, the aligned videos exhibit severe temporal discontinuity without additional post-processing steps. The recent advancements in human body and hand pose estimation in the wild promise new ways of addressing the task of human action alignment in videos. In this work, based on off-the-shelf human pose estimators, we propose a novel context-aware self-supervised learning architecture to align sequences of actions. We name it CASA. Specifically, CASA employs self-attention and cross-attention mechanisms to incorporate the spatial and temporal context of human actions, which can solve the temporal discontinuity problem. Moreover, we introduce a self-supervised learning scheme that is empowered by novel 4D augmentation techniques for 3D skeleton representations. We systematically evaluate the key components of our method. Our experiments on three public datasets demonstrate CASA significantly improves phase progress and Kendall's Tau scores over the previous state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题