带有图像描述符的语言模型是强大的少数视频语言学习者

论文标题

带有图像描述符的语言模型是强大的少数视频语言学习者

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

论文作者

Wang, Zhenhailong, Li, Manling, Xu, Ruochen, Zhou, Luowei, Lei, Jie, Lin, Xudong, Wang, Shuohang, Yang, Ziyi, Zhu, Chenguang, Hoiem, Derek, Chang, Shih-Fu, Bansal, Mohit, Ji, Heng

论文摘要

这项工作的目的是构建灵活的视频语言模型，这些模型可以从几个示例中概括为各种视频到文本任务，例如特定于领域的字幕，问题回答和未来的事件预测。现有的几张视频语言学习者仅专注于编码器，导致没有视频到文本解码器来处理生成任务。视频字幕人员已在大规模的视频语言数据集上进行了预定，但是他们很大程度上依赖于填充性，并且缺乏在几次设置中生成文本的文本的能力。我们通过图像和语言模型提出了Vidil，这是一些视频学习者，该视频模型在几乎没有视频到文本的任务上表现出强烈的性能，而无需在任何视频数据集上进行预处理或填充。我们使用图像语言模型将视频内容转换为框架字幕，对象，属性和事件短语，并将它们组合成时间结构模板。然后，我们指示一个语言模型，其中包含一些文本示例的提示，以从组成内容中生成目标输出。提示的灵活性使模型可以捕获任何形式的文本输入，例如自动语音识别（ASR）成绩单。我们的实验证明了语言模型在理解各种视频语言任务的视频方面的力量，包括视频字幕，视频问题回答，视频字幕检索以及视频未来事件预测。特别是，在视频未来的活动预测中，我们的几杆模型大大优于在大规模视频数据集中训练的最先进的监督模型。代码和资源可在https://github.com/mikewangwzhl/vidil上公开提供。

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language Learner via Image and Language models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases, and compose them into a temporal structure template. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content. The flexibility of prompting allows the model to capture any form of text input, such as automatic speech recognition (ASR) transcripts. Our experiments demonstrate the power of language models in understanding videos on a wide variety of video-language tasks, including video captioning, video question answering, video caption retrieval, and video future event prediction. Especially, on video future event prediction, our few-shot model significantly outperforms state-of-the-art supervised models trained on large-scale video datasets. Code and resources are publicly available for research purposes at https://github.com/MikeWangWZHL/VidIL .

下载PDF全文

下载文献需遵守相关版权规定

论文标题