RCDPT：雷达相机融合密集的预测变压器

论文标题

RCDPT：雷达相机融合密集的预测变压器

RCDPT: Radar-Camera fusion Dense Prediction Transformer

论文作者

Lo, Chen-Chou, Vandewalle, Patrick

论文摘要

最近，变形金刚在自然语言处理中的表现优于传统的深层神经网络，与卷积骨架相比，在许多计算机视觉任务中都表现出巨大的潜力。在原始变压器中，读出令牌被用作指定的向量，以汇总其他令牌的信息。但是，在视觉变压器中使用读数令牌的性能是有限的。因此，我们提出了一种新颖的融合策略，通过重新组装带有雷达表示的相机表示，将雷达数据整合到密集的预测变压器网络中。雷达表示不使用读数令牌，而是为单眼深度估计模型提供了更多的深度信息并提高性能。我们进一步研究了通常用于在密集的预测变压器网络中整合其他模态的不同融合方法。实验是在Nuscenes数据集上进行的，其中包括相机图像，激光雷达和雷达数据。结果表明，我们所提出的方法比常用的融合策略产生更好的性能，并且优于融合相机图像和雷达的现有卷积深度估计模型。

Recently, transformer networks have outperformed traditional deep neural networks in natural language processing and show a large potential in many computer vision tasks compared to convolutional backbones. In the original transformer, readout tokens are used as designated vectors for aggregating information from other tokens. However, the performance of using readout tokens in a vision transformer is limited. Therefore, we propose a novel fusion strategy to integrate radar data into a dense prediction transformer network by reassembling camera representations with radar representations. Instead of using readout tokens, radar representations contribute additional depth information to a monocular depth estimation model and improve performance. We further investigate different fusion approaches that are commonly used for integrating additional modality in a dense prediction transformer network. The experiments are conducted on the nuScenes dataset, which includes camera images, lidar, and radar data. The results show that our proposed method yields better performance than the commonly used fusion strategies and outperforms existing convolutional depth estimation models that fuse camera images and radar.

下载PDF全文

下载文献需遵守相关版权规定

论文标题