在《情人》的眼中：第一人称视频中的凝视和动作

论文标题

在《情人》的眼中：第一人称视频中的凝视和动作

In the Eye of the Beholder: Gaze and Actions in First Person Video

论文作者

Li, Yin, Liu, Miao, Rehg, James M.

论文摘要

我们解决了共同确定一个人在做什么以及他们在何处查看的任务，这些任务是基于对毛线摄像机捕获的视频的分析。为了促进我们的研究，我们首先介绍EGTEA凝视+数据集。我们的数据集配备了视频，凝视跟踪数据，手罩和动作注释，从而为第一人称视觉（FPV）提供了最全面的基准。超越数据集，我们提出了一个新颖的深层模型，用于FPV中的关节凝视估计和动作识别。我们的方法将参与者的目光描述为概率变量，并使用深网中的随机单元对其分布进行建模。我们从这些随机单元中进一步采样，生成一个注意图，以指导视觉特征的聚集以进行动作识别。我们的方法对我们的EGTEA凝视+数据集进行了评估，并达到了超过最先进的性能水平。更重要的是，我们证明我们的模型可以应用于较大规模的FPV数据集 - 即使不使用凝视的史诗般的kitchens，也可以为FPV动作识别提供新的最新结果。

We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with videos, gaze tracking data, hand masks and action annotations, thereby providing the most comprehensive benchmark for First Person Vision (FPV). Moving beyond the dataset, we propose a novel deep model for joint gaze estimation and action recognition in FPV. Our method describes the participant's gaze as a probabilistic variable and models its distribution using stochastic units in a deep network. We further sample from these stochastic units, generating an attention map to guide the aggregation of visual features for action recognition. Our method is evaluated on our EGTEA Gaze+ dataset and achieves a performance level that exceeds the state-of-the-art by a significant margin. More importantly, we demonstrate that our model can be applied to larger scale FPV dataset---EPIC-Kitchens even without using gaze, offering new state-of-the-art results on FPV action recognition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题