论文标题

sgva-clip:示意图像分类的视觉模型的语义引导的视觉调整

SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

论文作者

Peng, Fang, Yang, Xiaoshan, Xiao, Linhui, Wang, Yaowei, Xu, Changsheng

论文摘要

尽管在几次学习中已经取得了重大进展,但现有的少数图像分类方法中的大多数都需要对大量基类样本进行监督预培训,这限制了其在现实世界应用中的概括能力。最近,大规模视觉语言预训练的模型(VLP)在几次学习中一直在越来越多的关注,因为它们可以为可转移的视觉表示学习提供新的范式,并在网络上易于使用。但是,VLP可能会忽略语言句子难以描述的详细视觉信息,但对于学习有效分类器以区分不同图像的有效分类器很重要。为了解决上述问题,我们提出了一个新框架,称为语义引导的视觉适应(SGVA),该框架可以有效地扩展视觉语言预训练的模型,以通过使用隐式知识蒸馏,视觉特异性对比损失和交叉模式对比相比的损失来全面地产生歧视性适应视觉特征。隐式知识蒸馏旨在转移细粒的跨模式知识,以指导视觉适配器的更新。 13个数据集的最新结果表明,适应的视觉特征可以很好地补充交叉模式的功能,以改善少量图像分类。

Although significant progress has been made in few-shot learning, most of existing few-shot image classification methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale Vision-Language Pre-trained models (VLPs) have been gaining increasing attention in few-shot learning because they can provide a new paradigm for transferable visual representation learning with easily available text on the Web. However, the VLPs may neglect detailed visual information that is difficult to describe by language sentences, but important for learning an effective classifier to distinguish different images. To address the above problem, we propose a new framework, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative adapted visual features by comprehensively using an implicit knowledge distillation, a vision-specific contrastive loss, and a cross-modal contrastive loss. The implicit knowledge distillation is designed to transfer the fine-grained cross-modal knowledge to guide the updating of the vision adapter. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源