JGR-P2O：基于联合图推理基于单个深度图像的3D手姿势估算的基于联合图形推理的预测网络

论文标题

JGR-P2O：基于联合图推理基于单个深度图像的3D手姿势估算的基于联合图形推理的预测网络

JGR-P2O: Joint Graph Reasoning based Pixel-to-Offset Prediction Network for 3D Hand Pose Estimation from a Single Depth Image

论文作者

Fang, Linpu, Liu, Xingyan, Liu, Li, Xu, Hang, Kang, Wenxiong

论文摘要

最先进的单个深度基于图像的3D手姿势估计方法基于密集的预测，包括素至素的预测，点对点回归和像素的估计。尽管表现良好，但这些方法在本质上存在一些问题，例如准确性和效率之间的权衡不佳，以及与当地卷积的平淡特征代表学习。在本文中，提出了一种新颖的基于像素预测的方法来解决上述问题。关键思想是两个方面的： b）统一密集的像素偏移预测和端到端训练的直接关节回归。具体而言，我们首先提出了一个基于图形卷积网络（GCN）的关节图推理模块，以模拟关节之间的复杂依赖性并增强每个像素的表示能力。然后，我们密集地将所有像素的偏移量估算为图像平面和深度空间中的关节，并在所有像素的预测中通过加权平均值来计算关节位置，从而完全丢弃了复杂的后后处理操作。提出的模型是通过有效的2D完全卷积网络（FCN）主干实现的，只有约140万参数。在多个3D手姿势估计基准上进行的广泛实验表明，所提出的方法可实现新的最新精度，同时在单个NVIDIA 1080TI GPU上以110fps的速度非常有效地运行。

State-of-the-art single depth image-based 3D hand pose estimation methods are based on dense predictions, including voxel-to-voxel predictions, point-to-point regression, and pixel-wise estimations. Despite the good performance, those methods have a few issues in nature, such as the poor trade-off between accuracy and efficiency, and plain feature representation learning with local convolutions. In this paper, a novel pixel-wise prediction-based method is proposed to address the above issues. The key ideas are two-fold: a) explicitly modeling the dependencies among joints and the relations between the pixels and the joints for better local feature representation learning; b) unifying the dense pixel-wise offset predictions and direct joint regression for end-to-end training. Specifically, we first propose a graph convolutional network (GCN) based joint graph reasoning module to model the complex dependencies among joints and augment the representation capability of each pixel. Then we densely estimate all pixels' offsets to joints in both image plane and depth space and calculate the joints' positions by a weighted average over all pixels' predictions, totally discarding the complex postprocessing operations. The proposed model is implemented with an efficient 2D fully convolutional network (FCN) backbone and has only about 1.4M parameters. Extensive experiments on multiple 3D hand pose estimation benchmarks demonstrate that the proposed method achieves new state-of-the-art accuracy while running very efficiently with around a speed of 110fps on a single NVIDIA 1080Ti GPU.

下载PDF全文

下载文献需遵守相关版权规定

论文标题