提示：及时引导任务感知图像字幕

论文标题

提示：及时引导任务感知图像字幕

PromptCap: Prompt-Guided Task-Aware Image Captioning

论文作者

Hu, Yushi, Hua, Hang, Yang, Zhengyuan, Shi, Weijia, Smith, Noah A, Luo, Jiebo

论文摘要

基于知识的视觉问题答案（VQA）涉及需要世界知识以外的世界知识以产生正确答案的问题。大型语言模型（LMS）之类的GPT-3，由于其强大的知识检索和推理能力，对这项任务特别有用。为了使LM理解图像，先前的工作使用字幕模型将图像转换为文本。但是，当总结单个字幕句子中的图像时，通常会指定要描述的视觉实体。通用图像标题通常会错过LM正确回答视觉问题所必需的视觉细节。为了应对这一挑战，我们提出了Presspcap（及时引导的图像字幕），这是一个字幕模型，旨在用作图像和Black-Box LMS之间的更好的连接器。与通用标题不同，Presspcap采取自然语言提示来控制在生成的字幕中描述的视觉实体。提示包含一个问题，即标题应有助于回答。为了避免额外的注释，通过与GPT-3和现有数据集合成的示例训练了Presscap。我们在现有的管道上演示了Pickercap的有效性，其中GPT-3带有图像标题进行VQA。提示库的表现优于通用字幕，并在基于知识的VQA任务上实现最先进的准确性（OK-VQA的60.4％，A-OKVQA的59.6％）。 WebQA上的零射击结果表明，PickertCap可以很好地概括到看不见的域。

Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. The prompt contains a question that the caption should aid in answering. To avoid extra annotation, PromptCap is trained by examples synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA. PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that PromptCap generalizes well to unseen domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题