EVA：探索蒙版视觉表示的限制

论文标题

EVA：探索蒙版视觉表示的限制

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

论文作者

Fang, Yuxin, Wang, Wen, Xie, Binhui, Sun, Quan, Wu, Ledell, Wang, Xinggang, Huang, Tiejun, Wang, Xinlong, Cao, Yue

论文摘要

我们启动了以视觉为中心的基础模型EVA，旨在仅使用公共访问的数据探索视觉表示的限制。 EVA是一种预先训练的香草VIT，可重建以可见图像贴片为条件的掩盖的图像文本对齐视觉特征。通过这项借口任务，我们可以有效地将EVA扩展到十亿个参数，并在无需重大监督培训的情况下，将新的记录设置为各种代表性的视觉下游任务，例如图像识别，视频操作识别，对象识别，实例细分和语义细分。此外，我们观察到缩放EVA的定量变化会导致其他模型中不存在的转移学习绩效的质量变化。例如，EVA在具有挑战性的大型词汇实例细分任务中取得了长足的飞跃：我们的模型在LVISV1.0数据集中达到了几乎相同的最新性能，其中一千多个类别和COCO数据集只有80个类别。除了纯粹的视觉编码器之外，EVA还可以充当以视觉为中心的多模式枢轴，可连接图像和文本。我们发现初始化EVA的巨型剪辑的视觉塔可以极大地稳定训练，并以较少的样本和更少的计算来优于Scratch对应的训练，从而提供了新的方向来扩展和加速对多模式基础模型的昂贵培训。为了促进未来的研究，我们在https://github.com/baaivision/eva上发布了所有代码和模型。

We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题