论文标题
从姿势 - 噪声2D图像中学习3D语义,并使用分层全注意网络
Learning 3D Semantics from Pose-Noisy 2D Images with Hierarchical Full Attention Network
论文作者
论文摘要
我们提出了一个新颖的框架,以从包含姿势误差的2D多视图图像观察结果中学习3D点云语义。一方面,与从紧凑型和上下文富含上下文的2D RGB图像中学习相比,直接从庞大,非结构化和无序的3D点云中学习更加困难。另一方面,LIDAR点云和RGB图像均在标准自动驾驶数据集中捕获。这促使我们进行了“任务转移”范式,以使3D语义分割受益于汇总的2D语义提示,尽管姿势噪声包含在2D图像观察中。在所有困难中,来自2D语义分割方法的姿势噪声和错误的预测是任务转移的主要挑战。为了减轻这些因素的影响,我们使用多视图图像来感知每个3D点,对于每个单个图像,斑块观察都与之相关。此外,同时预测了相邻3D点的语义标签,从而使我们能够在进一步提高性能之前利用点结构。分层全注意网络〜(HIFANET)旨在依次汇总补丁,框架和点间的语义提示,并具有针对不同语义提示的分层注意机制。同样,在进食到下一个注意力区域之前,每个之前的注意力块都大大减少了特征大小,从而使我们的框架苗条。 Smantic-Kitti上的实验结果表明,所提出的框架的表现优于现有的基于3D点云的方法,它需要更少的训练数据,并且表现出对构成噪声的耐受性。该代码可在https://github.com/yuhanghe01/hifanet上找到。
We propose a novel framework to learn 3D point cloud semantics from 2D multi-view image observations containing pose error. On the one hand, directly learning from the massive, unstructured and unordered 3D point cloud is computationally and algorithmically more difficult than learning from compactly-organized and context-rich 2D RGB images. On the other hand, both LiDAR point cloud and RGB images are captured in standard automated-driving datasets. This motivates us to conduct a "task transfer" paradigm so that 3D semantic segmentation benefits from aggregating 2D semantic cues, albeit pose noises are contained in 2D image observations. Among all difficulties, pose noise and erroneous prediction from 2D semantic segmentation approaches are the main challenges for the task transfer. To alleviate the influence of those factor, we perceive each 3D point using multi-view images and for each single image a patch observation is associated. Moreover, the semantic labels of a block of neighboring 3D points are predicted simultaneously, enabling us to exploit the point structure prior to further improve the performance. A hierarchical full attention network~(HiFANet) is designed to sequentially aggregates patch, bag-of-frames and inter-point semantic cues, with hierarchical attention mechanism tailored for different level of semantic cues. Also, each preceding attention block largely reduces the feature size before feeding to the next attention block, making our framework slim. Experiment results on Semantic-KITTI show that the proposed framework outperforms existing 3D point cloud based methods significantly, it requires much less training data and exhibits tolerance to pose noise. The code is available at https://github.com/yuhanghe01/HiFANet.