abourddepth：纠缠着周围的自我监督多摄像机深度估计

论文标题

abourddepth：纠缠着周围的自我监督多摄像机深度估计

SurroundDepth: Entangling Surrounding Views for Self-Supervised Multi-Camera Depth Estimation

论文作者

Wei, Yi, Zhao, Linqing, Zheng, Wenzhao, Zhu, Zheng, Rao, Yongming, Huang, Guan, Lu, Jiwen, Zhou, Jie

论文摘要

图像的深度估计是自动驾驶3D感知的基本步骤，并且是LIDAR等昂贵深度传感器的经济替代方案。时间光度限制可实现无标签的自制深度估计，从而进一步促进其应用。但是，大多数现有方法仅根据每个单眼图像来预测深度，而忽略了多个周围相机之间的相关性，这些相机通常可用于现代自动驾驶车辆。在本文中，我们提出了一种环绕方法，以合并来自多个周围视图的信息，以预测跨相机的深度图。具体来说，我们采用联合网络来处理所有周围的观点，并提出跨视图变压器从多个视图中有效融合信息。我们应用跨视图自我注意力，有效地实现多相机特征图之间的全局相互作用。与自我监督的单眼深度估计不同，我们能够预测给定多相机外部矩阵的现实世界量表。为了实现这一目标，我们采用了两框结构，以提取尺度感知的伪深度以预处理模型。此外，我们没有预测每个摄像机的自我运动，而是估计车辆的普遍自我运动，并将其转移到每个视图中以实现多视图的自我运动一致性。在实验中，我们的方法在具有挑战性的多相机深度估计数据集DDAD和NUSCENES上实现了最新的性能。

Depth estimation from images serves as the fundamental step of 3D perception for autonomous driving and is an economical alternative to expensive depth sensors like LiDAR. The temporal photometric constraints enables self-supervised depth estimation without labels, further facilitating its application. However, most existing methods predict the depth solely based on each monocular image and ignore the correlations among multiple surrounding cameras, which are typically available for modern self-driving vehicles. In this paper, we propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras. Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views. We apply cross-view self-attention to efficiently enable the global interactions between multi-camera feature maps. Different from self-supervised monocular depth estimation, we are able to predict real-world scales given multi-camera extrinsic matrices. To achieve this goal, we adopt the two-frame structure-from-motion to extract scale-aware pseudo depths to pretrain the models. Further, instead of predicting the ego-motion of each individual camera, we estimate a universal ego-motion of the vehicle and transfer it to each view to achieve multi-view ego-motion consistency. In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets DDAD and nuScenes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题