启动蒙面的自动编码器，用于视觉预训练

论文标题

启动蒙面的自动编码器，用于视觉预训练

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

论文作者

Dong, Xiaoyi, Bao, Jianmin, Zhang, Ting, Chen, Dongdong, Zhang, Weiming, Yuan, Lu, Chen, Dong, Wen, Fang, Yu, Nenghai

论文摘要

我们提出了引导蒙面的自动编码器（bootmae），这是一种视觉培训的新方法。 Bootmae用两个核心设计改进了原始的蒙版自动编码器（MAE）：1）动量编码器，该动量编码器可作为额外的BERT预测目标提供在线功能； 2）试图降低编码器的压力以记住目标特定信息的靶向解码器。第一个设计的动机是通过观察到的，即使用经过验证的MAE提取特征，因为掩盖令牌的BERT预测目标可以实现更好的预刻表现。因此，我们与原始的MAE编码器并行添加了一个动量编码器，该编码器通过使用自己的表示作为BERT预测目标来引导预处理性能。在第二个设计中，我们将特定于目标的信息（例如，未掩盖贴片的像素值）直接引入解码器，以减少记住目标特定信息的编码器的压力。因此，编码器专注于语义建模，这是BERT进行预处理的目的，并且不需要浪费其在记住与预测目标相关的未掩盖令牌的信息时的能力。通过广泛的实验，我们的Bootmae在ImageNet-1K上具有$ 84.2 \％$ $ $ $ 0.8 \％$在同一预训练时期的$+0.8 \％$。 BootMAE also gets $+1.0$ mIoU improvements on semantic segmentation on ADE20K and $+1.3$ box AP, $+1.4$ mask AP improvement on object detection and segmentation on COCO dataset.代码在https://github.com/lightdxy/bootmae上发布。

We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs: 1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining. The first design is motivated by the observation that using a pretrained MAE to extract the features as the BERT prediction target for masked tokens can achieve better pretraining performance. Therefore, we add a momentum encoder in parallel with the original MAE encoder, which bootstraps the pretraining performance by using its own representation as the BERT prediction target. In the second design, we introduce target-specific information (e.g., pixel values of unmasked patches) from the encoder directly to the decoder to reduce the pressure on the encoder of memorizing the target-specific information. Thus, the encoder focuses on semantic modeling, which is the goal of BERT pretraining, and does not need to waste its capacity in memorizing the information of unmasked tokens related to the prediction target. Through extensive experiments, our BootMAE achieves $84.2\%$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming MAE by $+0.8\%$ under the same pre-training epochs. BootMAE also gets $+1.0$ mIoU improvements on semantic segmentation on ADE20K and $+1.3$ box AP, $+1.4$ mask AP improvement on object detection and segmentation on COCO dataset. Code is released at https://github.com/LightDXY/BootMAE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题