论文标题
命名amask:从互补基础模型中提取细分器
NamedMask: Distilling Segmenters from Complementary Foundation Models
论文作者
论文摘要
这项工作的目的是在训练过程中划分和命名图像区域,而无需访问像素级标签。为了解决这项任务,我们通过提炼两个基础模型的互补优势来构建细分器。第一个剪辑(Radford等,2021)具有将名称分配给图像内容的能力,但缺乏对象结构的可访问表示。第二个Dino(Caron等,2021)捕获了物体的空间范围,但对对象名称不了解。我们的方法称为名为Mask,开始使用夹子来构建特定于类别的图像档案。这些图像用dino的类别 - 敏锐的对象检测器进行伪标记,然后使用夹子档案标签通过特定于类别的细分器进行完善。得益于精制面具的高质量,我们表明,在这些档案中训练了具有适当数据增强的标准分割体系结构,可以为单对象和多对象图像带来令人印象深刻的语义细分能力。结果,我们提出的名字命名为在包括VOC2012,可可和大尺度Imasenet-S数据集在内的五个基准上的一系列先前工作中表现出色。
The goal of this work is to segment and name regions of images without access to pixel-level labels during training. To tackle this task, we construct segmenters by distilling the complementary strengths of two foundation models. The first, CLIP (Radford et al. 2021), exhibits the ability to assign names to image content but lacks an accessible representation of object structure. The second, DINO (Caron et al. 2021), captures the spatial extent of objects but has no knowledge of object names. Our method, termed NamedMask, begins by using CLIP to construct category-specific archives of images. These images are pseudo-labelled with a category-agnostic salient object detector bootstrapped from DINO, then refined by category-specific segmenters using the CLIP archive labels. Thanks to the high quality of the refined masks, we show that a standard segmentation architecture trained on these archives with appropriate data augmentation achieves impressive semantic segmentation abilities for both single-object and multi-object images. As a result, our proposed NamedMask performs favourably against a range of prior work on five benchmarks including the VOC2012, COCO and large-scale ImageNet-S datasets.