论文标题
看到您错过的内容:语义完成学习的视觉预训练
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
论文作者
论文摘要
跨模式比对对于视觉语言预训练(VLP)模型至关重要,以学习跨不同方式的正确相应信息。为此,受到NLP预训练领域中蒙版语言建模(MLM)任务的成功的启发,已提出了许多蒙版的建模任务,以进一步促进跨模式相互作用。以前的掩盖建模任务的核心思想是专注于基于可见的上下文来学习局部到局部对齐的掩盖令牌。但是,他们中的大多数人很少关注掩盖数据生成的全球语义特征,从而导致全球表示的跨模式对齐能力有限。因此,在本文中,我们提出了一项新颖的语义完成学习(SCL)任务,该任务是与现有的蒙版建模任务互补的,以促进全球到本地的一致性。具体而言,SCL任务通过捕获其他模式的相应信息来补充蒙版数据的缺失语义,从而促进了学习更多代表性的全局特征,这些特征对下游任务的性能产生了很大的影响。此外,我们提出了一个灵活的视觉编码器,它使我们的模型能够同时执行图像文本和视频文本多模式任务。实验结果表明,我们提出的方法在各种视觉语言基准中获得了最先进的性能,例如视觉询问答案,图像文本检索和视频文本检索。
Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-to-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations. Therefore, in this paper, we propose a novel Semantic Completion Learning (SCL) task, complementary to existing masked modeling tasks, to facilitate global-to-local alignment. Specifically, the SCL task complements the missing semantics of masked data by capturing the corresponding information from the other modality, promoting learning more representative global features which have a great impact on the performance of downstream tasks. Moreover, we present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval.