拷贝性叠加一致的深度区域改善了城市场景细分的对比度学习

论文标题

拷贝性叠加一致的深度区域改善了城市场景细分的对比度学习

Copy-Pasting Coherent Depth Regions Improves Contrastive Learning for Urban-Scene Segmentation

论文作者

Zeng, Liang, Lengyel, Attila, Tömen, Nergis, van Gemert, Jan

论文摘要

在这项工作中，我们利用估计的深度来提高自我监督的对比学习，以分割城市场景，在这种情况下，无标记的视频很容易用于培训自我监督的深度估计。我们认为，3D空间中一个连贯的像素的语义是独立的，并且在它们出现的上下文中不变。我们将连贯的，语义上相关的像素分组为相干的深度区域，鉴于它们的估计深度并使用拷贝性叠加来合成改变其上下文。以这种方式，跨文本的对应关系是在对比学习中构建的，并学习了上下文不变的表示。对于城市场景的无监督语义分割，我们的方法超过了Miou的先前最先进的基线，而City scapes的基线则超过了7.14％，而Kitti的最新基准超过了 +6.65％。要对城市景观和Kitti细分进行微调，我们的方法与现有模型具有竞争力，但是，我们不需要对Imagenet或Coco进行预训练，并且在计算上也更加有效。我们的代码可在https://github.com/leungtsang/cpcdr上找到

In this work, we leverage estimated depth to boost self-supervised contrastive learning for segmentation of urban scenes, where unlabeled videos are readily available for training self-supervised depth estimation. We argue that the semantics of a coherent group of pixels in 3D space is self-contained and invariant to the contexts in which they appear. We group coherent, semantically related pixels into coherent depth regions given their estimated depth and use copy-paste to synthetically vary their contexts. In this way, cross-context correspondences are built in contrastive learning and a context-invariant representation is learned. For unsupervised semantic segmentation of urban scenes, our method surpasses the previous state-of-the-art baseline by +7.14% in mIoU on Cityscapes and +6.65% on KITTI. For fine-tuning on Cityscapes and KITTI segmentation, our method is competitive with existing models, yet, we do not need to pre-train on ImageNet or COCO, and we are also more computationally efficient. Our code is available on https://github.com/LeungTsang/CPCDR

下载PDF全文

下载文献需遵守相关版权规定

论文标题