捕获和推断致密的全身人类习惯接触

论文标题

捕获和推断致密的全身人类习惯接触

Capturing and Inferring Dense Full-Body Human-Scene Contact

论文作者

Huang, Chun-Hao P., Yi, Hongwei, Höschle, Markus, Safroshkin, Matvey, Alexiadis, Tsvetelina, Polikovsky, Senya, Scharstein, Daniel, Black, Michael J.

论文摘要

推断人类习惯接触（HSC）是了解人类如何与周围环境相互作用的第一步。尽管检测2D人类对象的相互作用（HOI）和重建3D人姿势和形状（HPS）的进展取得了重大进展，但从单个图像中进行大约3D的人类定义接触仍然具有挑战性。现有的HSC检测方法仅考虑几种类型的预定义接触，通常将身体和场景降低到少量原语，甚至忽略了图像证据。为了预测单个图像的人类场景接触，我们从数据和算法的角度解决了上述局限性。我们捕获了一个名为“真实场景，互动，联系人和人类”的新数据集。 Rich在4K分辨率上包含多视图室外/室内视频序列，使用无标记运动捕获，3D身体扫描和高分辨率3D场景扫描捕获的地面3D人体。 RICH的一个关键特征是它还包含身体上精确的顶点级接触标签。使用Rich，我们训练一个网络，该网络可预测单个RGB图像的密集车身场景触点。我们的关键见解是，接触中的区域总是被阻塞，因此网络需要能够探索整个图像以获取证据。我们使用变压器学习这种非本地关系，并提出新的身体场景接触变压器（BSTRO）。很少有方法探索3D接触；那些只专注于脚的人，将脚接触作为后处理步骤，或从身体姿势中推断出无需看现场的接触。据我们所知，BSTRO是直接从单个图像中直接估计3D身体场景接触的方法。我们证明，BSTRO的表现明显优于先前的艺术。代码和数据集可在https://rich.is.tue.mpg.de上获得。

Inferring human-scene contact (HSC) is the first step toward understanding how humans interact with their surroundings. While detecting 2D human-object interaction (HOI) and reconstructing 3D human pose and shape (HPS) have enjoyed significant progress, reasoning about 3D human-scene contact from a single image is still challenging. Existing HSC detection methods consider only a few types of predefined contact, often reduce body and scene to a small number of primitives, and even overlook image evidence. To predict human-scene contact from a single image, we address the limitations above from both data and algorithmic perspectives. We capture a new dataset called RICH for "Real scenes, Interaction, Contact and Humans." RICH contains multiview outdoor/indoor video sequences at 4K resolution, ground-truth 3D human bodies captured using markerless motion capture, 3D body scans, and high resolution 3D scene scans. A key feature of RICH is that it also contains accurate vertex-level contact labels on the body. Using RICH, we train a network that predicts dense body-scene contacts from a single RGB image. Our key insight is that regions in contact are always occluded so the network needs the ability to explore the whole image for evidence. We use a transformer to learn such non-local relationships and propose a new Body-Scene contact TRansfOrmer (BSTRO). Very few methods explore 3D contact; those that do focus on the feet only, detect foot contact as a post-processing step, or infer contact from body pose without looking at the scene. To our knowledge, BSTRO is the first method to directly estimate 3D body-scene contact from a single image. We demonstrate that BSTRO significantly outperforms the prior art. The code and dataset are available at https://rich.is.tue.mpg.de.

下载PDF全文

下载文献需遵守相关版权规定

论文标题