对视觉和语言扰动的视频语言模型的鲁棒性分析

论文标题

对视觉和语言扰动的视频语言模型的鲁棒性分析

Robustness Analysis of Video-Language Models Against Visual and Language Perturbations

论文作者

Schiappa, Madeline C., Vyas, Shruti, Palangi, Hamid, Rawat, Yogesh S., Vineet, Vibhav

论文摘要

与单模式学习相比，大型数据集上的联合视觉和语言建模最近在多模式任务中表现出良好的进步。但是，尚未研究这些方法对现实世界扰动的鲁棒性。在这项工作中，我们对视频语言模型进行了首次针对各种现实世界扰动的广泛鲁棒性研究。我们专注于文本到视频检索，并提出了两个大规模基准数据集，即MSRVTT-P和YouCook2-P，它们利用了90个不同的视觉和35个不同的文本扰动。该研究揭示了研究模型中一些有趣的初始发现：1）仅当视频受到扰动而不是仅扰动文本时，2）预先训练的模型比从头开始训练的模型更强大，3）模型更强大，3）模型更强大，3）模型参加场景和对象而不是运动和动作。我们希望这项研究能够作为基准，并指导鲁棒视频学习的未来研究。本研究中引入的基准以及代码和数据集可在https://bit.ly/3cnoly4上找到。

Joint visual and language modeling on large-scale datasets has recently shown good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of video-language models against various real-world perturbations. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different text perturbations. The study reveals some interesting initial findings from the studied models: 1) models are generally more susceptible when only video is perturbed as opposed to when only text is perturbed, 2) models that are pre-trained are more robust than those trained from scratch, 3) models attend more to scene and objects rather than motion and action. We hope this study will serve as a benchmark and guide future research in robust video-language learning. The benchmark introduced in this study along with the code and datasets is available at https://bit.ly/3CNOly4.

下载PDF全文

下载文献需遵守相关版权规定

论文标题