夹子固定：用于图像重新识别的视觉语言模型，而无需具体文本标签

论文标题

夹子固定：用于图像重新识别的视觉语言模型，而无需具体文本标签

CLIP-ReID: Exploiting Vision-Language Model for Image Re-Identification without Concrete Text Labels

论文作者

Li, Siyuan, Sun, Li, Li, Qingli

论文摘要

诸如剪辑之类的预训练视觉模型最近在各种下游任务（包括图像分类和分割）上显示出了出色的性能。但是，在细粒的图像重新识别（REID）中，标签是索引，缺乏具体的文本描述。因此，尚待确定如何将这些模型应用于这些任务。本文首先发现，只需微调剪辑中图像编码器初始初始化的视觉模型，就已经在各种REID任务中获得了竞争性能。然后，我们提出了一种两阶段的策略，以促进更好的视觉表示。关键的想法是通过为每个ID的一组可学习的文本令牌完全利用剪辑中的跨模式描述能力，并将它们送到文本编码器中以形成模棱两可的描述。在第一个训练阶段，剪辑中的图像和文本编码保持固定，并且只有在批处理中计算的对比度损失从头开始优化文本令牌。在第二阶段，特定于ID的文本令牌及其编码器变为静态，为微调图像编码器提供了约束。在下游任务中设计的损失的帮助下，图像编码器能够准确地将数据表示为功能中的向量。拟议策略的有效性已在对人员或车辆REID任务的几个数据集上进行了验证。代码可在https://github.com/syliz517/clip-reid上找到。

Pre-trained vision-language models like CLIP have recently shown superior performances on various downstream tasks, including image classification and segmentation. However, in fine-grained image re-identification (ReID), the labels are indexes, lacking concrete text descriptions. Therefore, it remains to be determined how such models could be applied to these tasks. This paper first finds out that simply fine-tuning the visual model initialized by the image encoder in CLIP, has already obtained competitive performances in various ReID tasks. Then we propose a two-stage strategy to facilitate a better visual representation. The key idea is to fully exploit the cross-modal description ability in CLIP through a set of learnable text tokens for each ID and give them to the text encoder to form ambiguous descriptions. In the first training stage, image and text encoders from CLIP keep fixed, and only the text tokens are optimized from scratch by the contrastive loss computed within a batch. In the second stage, the ID-specific text tokens and their encoder become static, providing constraints for fine-tuning the image encoder. With the help of the designed loss in the downstream task, the image encoder is able to represent data as vectors in the feature embedding accurately. The effectiveness of the proposed strategy is validated on several datasets for the person or vehicle ReID tasks. Code is available at https://github.com/Syliz517/CLIP-ReID.

下载PDF全文

下载文献需遵守相关版权规定

论文标题