基于单眼摄像机和单一激光雷达的大型场景的弱监督的3D多人姿势估计

论文标题

基于单眼摄像机和单一激光雷达的大型场景的弱监督的3D多人姿势估计

Weakly Supervised 3D Multi-person Pose Estimation for Large-scale Scenes based on Monocular Camera and Single LiDAR

论文作者

Cong, Peishan, Xu, Yiteng, Ren, Yiming, Zhang, Juze, Xu, Lan, Wang, Jingya, Yu, Jingyi, Ma, Yuexin

论文摘要

对于基于单眼摄像机的3D多人姿势估计，深度估计通常是不适合的，并且模棱两可。由于激光雷达可以在长期场景中捕获准确的深度信息，因此它可以通过提供丰富的几何特征来使个人的全球定位和3D姿势估计受益。在此激励的情况下，我们提出了一个单眼相机和基于单元的单眼摄像头方法，用于在大型场景中进行3D多人姿势估计，该姿势易于部署，对光线不敏感。具体而言，我们设计了一种有效的融合策略来利用多模式输入数据（包括图像和点云），并充分利用时间信息来指导网络学习自然而相干的人类运动。在不依赖任何3D姿势注释的情况下，我们的方法利用了点云的固有几何约束来进行自学，并利用图像上的2D关键来进行弱监督。公共数据集和我们新收集的数据集的广泛实验证明了我们提出的方法的优势和概括能力。

Depth estimation is usually ill-posed and ambiguous for monocular camera-based 3D multi-person pose estimation. Since LiDAR can capture accurate depth information in long-range scenes, it can benefit both the global localization of individuals and the 3D pose estimation by providing rich geometry features. Motivated by this, we propose a monocular camera and single LiDAR-based method for 3D multi-person pose estimation in large-scale scenes, which is easy to deploy and insensitive to light. Specifically, we design an effective fusion strategy to take advantage of multi-modal input data, including images and point cloud, and make full use of temporal information to guide the network to learn natural and coherent human motions. Without relying on any 3D pose annotations, our method exploits the inherent geometry constraints of point cloud for self-supervision and utilizes 2D keypoints on images for weak supervision. Extensive experiments on public datasets and our newly collected dataset demonstrate the superiority and generalization capability of our proposed method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题