通过检测不正确的位置嵌入来表示学习

论文标题

通过检测不正确的位置嵌入来表示学习

Representation Learning by Detecting Incorrect Location Embeddings

论文作者

Sameni, Sepehr, Jenni, Simon, Favaro, Paolo

论文摘要

在本文中，我们为图像表示学习引入了一种新颖的自学学习（SSL）损失。人们越来越相信深度神经网络中的概括与它们区分对象形状的能力有关。由于对象形状与其零件的位置有关，因此我们建议检测那些人为放错位置的零件。我们代表具有图像令牌的对象部分，并训练VIT以检测到哪个令牌已与不正确的位置嵌入结合在一起。然后，我们在输入中引入稀疏性，以使模型更加强大，以加快训练的速度。我们称我们的方法难题为检测带有掩盖输入的位置嵌入不正确的位置。我们对Mocov3，Dino和Simclr应用困境，并在相同的训练时间和Imagenet-1K上的线性探测转移下，其性能分别提高了4.41％，3.97％和0.5％的性能。我们还显示了MAE的完整微调改进与Imagenet-100上的方法相结合。我们通过对通用SSL基准测试进行微调评估我们的方法。此外，我们表明，当下游任务强烈依赖形状（例如在瑜伽-82姿势数据集中）时，我们的预训练的功能比先前的工作产生了显着增长。

In this paper, we introduce a novel self-supervised learning (SSL) loss for image representation learning. There is a growing belief that generalization in deep neural networks is linked to their ability to discriminate object shapes. Since object shape is related to the location of its parts, we propose to detect those that have been artificially misplaced. We represent object parts with image tokens and train a ViT to detect which token has been combined with an incorrect positional embedding. We then introduce sparsity in the inputs to make the model more robust to occlusions and to speed up the training. We call our method DILEMMA, which stands for Detection of Incorrect Location EMbeddings with MAsked inputs. We apply DILEMMA to MoCoV3, DINO and SimCLR and show an improvement in their performance of respectively 4.41%, 3.97%, and 0.5% under the same training time and with a linear probing transfer on ImageNet-1K. We also show full fine-tuning improvements of MAE combined with our method on ImageNet-100. We evaluate our method via fine-tuning on common SSL benchmarks. Moreover, we show that when downstream tasks are strongly reliant on shape (such as in the YOGA-82 pose dataset), our pre-trained features yield a significant gain over prior work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题