论文标题
中文剪辑:对比度视觉语言在中文
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
论文作者
论文摘要
剪辑的巨大成功(Radford等,2021)促进了对比度学习的研究和应用。在这项工作中,我们在中文中构建了图像文本对的大规模数据集,其中大多数数据都是从公开可用的数据集中检索的,并且我们在新数据集中预先介绍了中国剪辑模型。我们开发了5种多种尺寸的中国剪辑模型,涵盖了77至9.58亿个参数。此外,我们提出了一种两阶段预处理方法,首先对模型进行训练,然后用图像编码器冷冻,然后对所有参数进行了优化的训练,以实现增强的模型性能。我们的全面实验表明,中国剪辑可以在Muge,Flickr30k-CN和Coco-CN上实现最先进的性能,以零拍摄学习和填充的设置,并且它能够基于Elevater Benchmark上的评估(LIE等,202222222年)的评估来实现零摄像图像分类的竞争性能。我们已经在https://github.com/ofa-sys/chinese-clip中发布了代码,模型和演示
The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). We have released our codes, models, and demos in https://github.com/OFA-Sys/Chinese-CLIP