通过语言获取将多模式预训练概括为多语言

论文标题

通过语言获取将多模式预训练概括为多语言

Generalizing Multimodal Pre-training into Multilingual via Language Acquisition

论文作者

Zhang, Liang, Hu, Anwen, Jin, Qin

论文摘要

基于英语的视觉培训（VLP）在各种下游任务中取得了巨大的成功。通过多种语言视觉预训练（M-VLP），已经采取了一些努力将这种成功推广到非英语语言。但是，由于大量语言，M-VLP模型通常需要庞大的计算资源，并且不能灵活地扩展到新语言。在这项工作中，我们提出了一个\ textbf {m} ulti \ textbf {l} ingual \ textbf {a} cquicition（mla）框架，该框架可以轻松地将单语言视觉的预训练模型概括为多语言。具体而言，我们根据最先进的单语VLP模型设计了轻巧的语言采集编码器。我们进一步提出了一种两阶段的培训策略，以优化语言获取编码器，即本地语言转移阶段和语言曝光阶段。通过多语言培训数据和计算资源，我们的模型在多语言图像文本和视频文本检索基准测试中实现了最先进的性能。

English-based Vision-Language Pre-training (VLP) has achieved great success in various downstream tasks. Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training (M-VLP). However, due to the large number of languages, M-VLP models often require huge computing resources and cannot be flexibly extended to new languages. In this work, we propose a \textbf{M}ulti\textbf{L}ingual \textbf{A}cquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual. Specifically, we design a lightweight language acquisition encoder based on state-of-the-art monolingual VLP models. We further propose a two-stage training strategy to optimize the language acquisition encoder, namely the Native Language Transfer stage and the Language Exposure stage. With much less multilingual training data and computing resources, our model achieves state-of-the-art performance on multilingual image-text and video-text retrieval benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题