论文标题
文本作为图像,以迅速调整多标签图像识别
Texts as Images in Prompt Tuning for Multi-Label Image Recognition
论文作者
论文摘要
及时调整已被用作适应大型视力语言预训练模型(例如剪辑)中的一种有效方法,以在数据限制或标签受限的设置中的各种下游任务中。但是,默认情况下,视觉数据(例如,图像)是在现有方法中学习提示的先决条件。在这项工作中,我们提倡,图像文本对比学习在对齐两种方式(用于训练剪辑)中的有效性进一步使将文本视为图像以及时调整并引入TAI提示。与视觉数据相反,文本描述易于收集,并且可以直接得出它们的类标签。特别是,我们将TAI提示应用于多标签图像识别,其中野外的句子是图像及时调整的替代方案。此外,使用TAI,进一步提出了双粒及时调整(TAI-DPT),以提取粗粒和细粒度的嵌入,以增强多标签识别性能。实验结果表明,我们提出的TAI-DPT在多个基准测试中的优于零镜头的剪辑优于零镜头,例如MS-Coco,VOC2007和NUS Wide,而它可以与现有的图像提示方法相结合,以进一步提高识别性能。代码在https://github.com/guozix/tai-dpt上发布。
Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. Code is released at https://github.com/guozix/TaI-DPT.