论文标题
拷贝性叠加一致的深度区域改善了城市场景细分的对比度学习
Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation
论文作者
论文摘要
在这项工作中,我们利用估计的深度来提高自我监督的对比学习,以分割城市场景,在这种情况下,无标记的视频很容易用于培训自我监督的深度估计。我们认为,3D空间中一个连贯的像素的语义是独立的,并且在它们出现的上下文中不变。我们将连贯的,语义上相关的像素分组为相干的深度区域,鉴于它们的估计深度并使用拷贝性叠加来合成改变其上下文。以这种方式,跨文本的对应关系是在对比学习中构建的,并学习了上下文不变的表示。对于城市场景的无监督语义分割,我们的方法超过了Miou的先前最先进的基线,而City scapes的基线则超过了7.14%,而Kitti的最新基准超过了 +6.65%。要对城市景观和Kitti细分进行微调,我们的方法与现有模型具有竞争力,但是,我们不需要对Imagenet或Coco进行预训练,并且在计算上也更加有效。我们的代码可在https://github.com/leungtsang/cpcdr上找到
In this work, we leverage estimated depth to boost self-supervised contrastive learning for segmentation of urban scenes, where unlabeled videos are readily available for training self-supervised depth estimation. We argue that the semantics of a coherent group of pixels in 3D space is self-contained and invariant to the contexts in which they appear. We group coherent, semantically related pixels into coherent depth regions given their estimated depth and use copy-paste to synthetically vary their contexts. In this way, cross-context correspondences are built in contrastive learning and a context-invariant representation is learned. For unsupervised semantic segmentation of urban scenes, our method surpasses the previous state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI. For fine-tuning on Cityscapes and KITTI segmentation, our method is competitive with existing models, yet, we do not need to pre-train on ImageNet or COCO, and we are also more computationally efficient. Our code is available on https://github.com/LeungTsang/CPCDR