CVIL：使用知识蒸馏的视觉模型的跨语性培训

论文标题

CVIL：使用知识蒸馏的视觉模型的跨语性培训

cViL: Cross-Lingual Training of Vision-Language Models using Knowledge Distillation

论文作者

Gupta, Kshitij, Gautam, Devansh, Mamidi, Radhika

论文摘要

视觉和语言任务在研究界越来越受欢迎，但重点仍主要放在英语上。我们提出了一条管道，该管道利用仅英语视觉语言模型来训练目标语言的单语模型。我们建议扩展Oscar+，该模型利用对象标签作为学习图像文本比对的锚点，以训练以不同语言的视觉问题回答数据集。我们提出了一种新型的知识蒸馏方法，以使用并行句子以其他语言来训练模型。与其他在训练前语料库中使用目标语言的模型相比，我们可以利用现有的英语模型使用明显较小的资源将知识转移到目标语言中。我们还以日语和印地语语言发布了一个大规模的视觉问题，以回答数据集。尽管我们将工作限制为视觉问题的回答，但我们的模型可以扩展到任何序列级别的分类任务，并且也可以将其扩展到其他语言。本文重点介绍了两种语言，用于回答日语和印地语的视觉问题。我们的管道的相对增加的准确性分别优于当前的最新模型4.4％和13.4％。

Vision-and-language tasks are gaining popularity in the research community, but the focus is still mainly on English. We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language. We propose to extend OSCAR+, a model which leverages object tags as anchor points for learning image-text alignments, to train on visual question answering datasets in different languages. We propose a novel approach to knowledge distillation to train the model in other languages using parallel sentences. Compared to other models that use the target language in the pretraining corpora, we can leverage an existing English model to transfer the knowledge to the target language using significantly lesser resources. We also release a large-scale visual question answering dataset in Japanese and Hindi language. Though we restrict our work to visual question answering, our model can be extended to any sequence-level classification task, and it can be extended to other languages as well. This paper focuses on two languages for the visual question answering task - Japanese and Hindi. Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4% and 13.4% respectively in accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题