论文标题
跨视图和跨模式对准如何影响对比学习中的表示?
How do Cross-View and Cross-Modal Alignment Affect Representations in Contrastive Learning?
论文作者
论文摘要
各种最新的自我监督的视觉表示方法学习方法通过将特征表示跨视图和/或模态对齐来利用来自多个传感器的数据。在这项工作中,我们研究了对齐表示如何影响图像和点云上的跨视图和跨模式对比度学习获得的视觉特征。在五个现实世界数据集和五个任务上,我们基于四个预读量的变化训练和评估108个模型。我们发现,跨模式表示对准丢弃互补的视觉信息,例如颜色和纹理,而是强调冗余深度提示。通过预处理获得的深度提示改善了下游深度预测性能。同样,总体而言,跨模式对准会导致比通过跨视图对齐的预训练更强大的编码器,尤其是在深度预测,实例分割和对象检测方面。
Various state-of-the-art self-supervised visual representation learning approaches take advantage of data from multiple sensors by aligning the feature representations across views and/or modalities. In this work, we investigate how aligning representations affects the visual features obtained from cross-view and cross-modal contrastive learning on images and point clouds. On five real-world datasets and on five tasks, we train and evaluate 108 models based on four pretraining variations. We find that cross-modal representation alignment discards complementary visual information, such as color and texture, and instead emphasizes redundant depth cues. The depth cues obtained from pretraining improve downstream depth prediction performance. Also overall, cross-modal alignment leads to more robust encoders than pre-training by cross-view alignment, especially on depth prediction, instance segmentation, and object detection.