多任务视觉语言提示调谐

论文标题

多任务视觉语言提示调谐

Multitask Vision-Language Prompt Tuning

论文作者

Shen, Sheng, Yang, Shijia, Zhang, Tianjun, Zhai, Bohan, Gonzalez, Joseph E., Keutzer, Kurt, Darrell, Trevor

论文摘要

迅速调整（根据特定于任务的学习提示向量调理）已成为一种数据效率和参数有效的方法，可将大型审慎的视觉模型调整为多个下游任务。但是，现有方法通常考虑学习提示每个任务的促进矢量，因此从头开始独立，因此无法利用各种视力语言任务的丰富共享知识。在本文中，我们提出了多任务视觉语言及时调整（MVLPT），该语言将交叉任务知识纳入视觉模型的及时调整中。具体而言，（i）我们演示了从多个源任务中学习单个可转移提示的有效性，以初始化每个目标任务的提示；（ii）我们表明，许多目标任务可以从共享提示向量中互相受益，因此可以通过多任务提示进行共同学习。我们使用三种代表性及时调整方法，即文本提示调整，视觉及时调整以及统一的视觉语言提示及时调整对提出的MVLPT进行基准测试。结果完成了20个视力任务，表明所提出的方法的表现优于所有单任务基线提示方法，将新的最新设置设置在少量发射的高级标准和交叉任务概括基准上。为了了解跨任务知识最有效的位置，我们还针对每种及时调整方法的400个组合中的20个视觉任务进行了有关任务传递性的大规模研究。它表明，每种及时调整方法的性能最多的MVLPT都喜欢不同的任务组合，并且许多任务可以互相受益，具体取决于其视觉相似性和标签相似性。代码可在https://github.com/sincerass/mvlpt上找到。

Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing approaches usually consider learning prompt vectors for each task independently from scratch, thereby failing to exploit the rich shareable knowledge across different vision-language tasks. In this paper, we propose multitask vision-language prompt tuning (MVLPT), which incorporates cross-task knowledge into prompt tuning for vision-language models. Specifically, (i) we demonstrate the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task; (ii) we show many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning. We benchmark the proposed MVLPT using three representative prompt tuning methods, namely text prompt tuning, visual prompt tuning, and the unified vision-language prompt tuning. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods, setting the new state-of-the-art on the few-shot ELEVATER benchmarks and cross-task generalization benchmarks. To understand where the cross-task knowledge is most effective, we also conduct a large-scale study on task transferability with 20 vision tasks in 400 combinations for each prompt tuning method. It shows that the most performant MVLPT for each prompt tuning method prefers different task combinations and many tasks can benefit each other, depending on their visual similarity and label similarity. Code is available at https://github.com/sIncerass/MVLPT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题