3D人类模型拟合在野外3D人体姿势估计的3D人体模型的示例微调

论文标题

3D人类模型拟合在野外3D人体姿势估计的3D人体模型的示例微调

Exemplar Fine-Tuning for 3D Human Model Fitting Towards In-the-Wild 3D Human Pose Estimation

论文作者

Joo, Hanbyul, Neverova, Natalia, Vedaldi, Andrea

论文摘要

与2D图像数据集（例如可可）不同，在野外很难获得具有3D地面真相注释的大型人数据集。在本文中，我们通过增加具有高质量3D姿势拟合的现有2D数据集来解决此问题。值得注意的是，由此产生的注释足以训练从头开始的3D姿势回归网络，该网络的表现优于3DPW等现场基准上最新的最新时间。此外，对我们增强数据的培训很简单，因为它不需要混合多个和不兼容的2D和3D数据集或使用复杂的网络体系结构和培训程序。这种简化的管道提供了其他改进，包括注入极端的作物增强量以更好地重建高度截断的人，并结合辅助输入以提高3D姿势估计的准确性。它还减少了对具有限制性许可的H36M等3D数据集的依赖性。我们还使用我们的方法引入新的基准测试，以研究诸如遮挡，截断和稀有身体姿势之类的现实世界挑战。为了获得受内部学习进展的启发的高质量3D伪注销，我们引入了示例性微调（EFT）。 EFT结合了拟合方法的重新注射精度，例如Smplify和3D姿势先前由预先训练的3D姿势回归网络隐式捕获。我们表明，EFT会产生3D注释，从而在广泛的基于人类的评估中可获得更好的下游性能，并且在质量上更可取。

Differently from 2D image datasets such as COCO, large-scale human datasets with 3D ground-truth annotations are very difficult to obtain in the wild. In this paper, we address this problem by augmenting existing 2D datasets with high-quality 3D pose fits. Remarkably, the resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks such as 3DPW. Additionally, training on our augmented data is straightforward as it does not require to mix multiple and incompatible 2D and 3D datasets or to use complicated network architectures and training procedures. This simplified pipeline affords additional improvements, including injecting extreme crop augmentations to better reconstruct highly truncated people, and incorporating auxiliary inputs to improve 3D pose estimation accuracy. It also reduces the dependency on 3D datasets such as H36M that have restrictive licenses. We also use our method to introduce new benchmarks for the study of real-world challenges such as occlusions, truncations, and rare body poses. In order to obtain such high quality 3D pseudo-annotations, inspired by progress in internal learning, we introduce Exemplar Fine-Tuning (EFT). EFT combines the re-projection accuracy of fitting methods like SMPLify with a 3D pose prior implicitly captured by a pre-trained 3D pose regressor network. We show that EFT produces 3D annotations that result in better downstream performance and are qualitatively preferable in an extensive human-based assessment.

下载PDF全文

下载文献需遵守相关版权规定

论文标题