环顾四周并参考：2D合成语义知识蒸馏3D视觉接地

论文标题

环顾四周并参考：2D合成语义知识蒸馏3D视觉接地

Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding

论文作者

Bakr, Eslam Mohamed, Alsaedy, Yasmeen, Elhoseiny, Mohamed

论文摘要

3D视觉接地任务已通过视觉和语言流探索，理解参考语言以在3D场景中识别目标对象。但是，大多数现有方法都将视觉流用于使用现成的点云编码器捕获3D视觉线索。我们在本文中解决的主要问题是“我们可以通过从点云合成的2D线索合并3D视觉流并有效地利用它们在训练和测试中？”。主要思想是通过合并富2D对象表示的不需要额外的2D输入来协助3D编码器。为此，我们利用了从3D点云合成生成的2D线索，并从经验上展示了它们提高学习视觉表示质量的能力。我们通过对NR3D，SR3D和ScanRefer数据集进行的全面实验来验证我们的方法，并与现有方法相比显示出一致的性能提高。我们提出的模块被称为环顾并参考（LAR），在三个基准测试基准（即NR3D，SR3D和ScanRefer）上的最新3D视觉接地技术明显优于最新的3D视觉接地技术。该代码可在https://eslambakr.github.io/lar.github.io/上找到。

The 3D visual grounding task has been explored with visual and language streams comprehending referential language to identify target objects in 3D scenes. However, most existing methods devote the visual stream to capturing the 3D visual clues using off-the-shelf point clouds encoders. The main question we address in this paper is "can we consolidate the 3D visual stream by 2D clues synthesized from point clouds and efficiently utilize them in training and testing?". The main idea is to assist the 3D encoder by incorporating rich 2D object representations without requiring extra 2D inputs. To this end, we leverage 2D clues, synthetically generated from 3D point clouds, and empirically show their aptitude to boost the quality of the learned visual representations. We validate our approach through comprehensive experiments on Nr3D, Sr3D, and ScanRefer datasets and show consistent performance gains compared to existing methods. Our proposed module, dubbed as Look Around and Refer (LAR), significantly outperforms the state-of-the-art 3D visual grounding techniques on three benchmarks, i.e., Nr3D, Sr3D, and ScanRefer. The code is available at https://eslambakr.github.io/LAR.github.io/.

下载PDF全文

下载文献需遵守相关版权规定

论文标题