样本中的对比度学习和对弱监督物体本地化的持续关注

论文标题

样本中的对比度学习和对弱监督物体本地化的持续关注

In-sample Contrastive Learning and Consistent Attention for Weakly Supervised Object Localization

论文作者

Ki, Minsong, Uh, Youngjung, Lee, Wonyoung, Byun, Hyeran

论文摘要

弱监督的对象本地化（WSOL）旨在仅使用图像级监督来定位目标对象。最近的方法鼓励该模型通过丢弃最歧视的部分来激活整个对象的特征图。但是，它们可能会诱导过度扩展到背景，从而导致过度估计的本地化。在本文中，我们将背景视为一种重要的提示，可以指导特征激活覆盖复杂的对象区域并提出对比度注意力损失。损失促进了前景与其丢弃版本之间的相似性，以及掉落版本和背景之间的相似性。此外，我们提出了前景的一致性损失，该损失会惩罚早期的层次，从而引起有关后面层的嘈杂关注，以此作为向他们提供背景感的参考。它指导早期层在物体上激活，而不是局部独特的背景，以使它们的注意力与后期相似。为了更好地优化上述损失，我们使用非本地注意力块来替代通道填充的注意力，从而考虑到空间相似性，从而增加了注意力图。最后但并非最不重要的一点是，除了最歧视区域外，我们还建议放下背景区域。我们的方法可在CUB-200-200-200-200-和Imagenet基准数据集上实现有关TOP-1本地化精度和MaxBoxACCV2的最先进性能，并且我们对我们的各个组件提供了详细的分析。该代码将在线公开可再现性。

Weakly supervised object localization (WSOL) aims to localize the target object using only the image-level supervision. Recent methods encourage the model to activate feature maps over the entire object by dropping the most discriminative parts. However, they are likely to induce excessive extension to the backgrounds which leads to over-estimated localization. In this paper, we consider the background as an important cue that guides the feature activation to cover the sophisticated object region and propose contrastive attention loss. The loss promotes similarity between foreground and its dropped version, and, dissimilarity between the dropped version and background. Furthermore, we propose foreground consistency loss that penalizes earlier layers producing noisy attention regarding the later layer as a reference to provide them with a sense of backgroundness. It guides the early layers to activate on objects rather than locally distinctive backgrounds so that their attentions to be similar to the later layer. For better optimizing the above losses, we use the non-local attention blocks to replace channel-pooled attention leading to enhanced attention maps considering the spatial similarity. Last but not least, we propose to drop background regions in addition to the most discriminative region. Our method achieves state-of-theart performance on CUB-200-2011 and ImageNet benchmark datasets regarding top-1 localization accuracy and MaxBoxAccV2, and we provide detailed analysis on our individual components. The code will be publicly available online for reproducibility.

下载PDF全文

下载文献需遵守相关版权规定

论文标题