MPLUG：通过跨模式跳过连接的有效有效的视觉学习

论文标题

MPLUG：通过跨模式跳过连接的有效有效的视觉学习

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections

论文作者

Li, Chenliang, Xu, Haiyang, Tian, Junfeng, Wang, Wei, Yan, Ming, Bi, Bin, Ye, Jiabo, Chen, Hehong, Xu, Guohai, Cao, Zheng, Zhang, Ji, Huang, Songfang, Huang, Fei, Zhou, Jingren, Si, Luo

论文摘要

大规模的基础模型一直是建造人工智能（AI）系统的新兴范式，可以快速适应各种下游任务。本文介绍了Mplug，这是一种新的视觉语言基础模型，用于跨模式理解和产生。大多数现有的预训练模型都遭受了跨模式比对中长时间视觉序列带来的低计算效率和信息不对称的问题。为了解决这些问题，MPLUG通过新颖的跨模式跳过连接引入了有效而有效的视觉结构，该结构创建了层间快捷方式，可以跳过一定数量的层，以在视觉方面进行耗时的全面自我观察。 MPLUG是与歧视和生成目标的大型图像文本对进行预训练的端到端。它可以实现最新的下游任务，例如图像字幕，图像文本检索，视觉接地和视觉问题回答等各种视觉语言。当直接传输到多个视频语言任务时，MPLUG还显示出强零转移性。

Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to multiple video-language tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题