用于细粒图像分类的数据增强视觉变压器

论文标题

用于细粒图像分类的数据增强视觉变压器

Data Augmentation Vision Transformer for Fine-grained Image Classification

论文作者

Hu, Chao, Zhu, Liqiang, Qiu, Weibin, Wu, Weijie

论文摘要

最近，视觉变压器（VIT）在图像识别方面取得了突破。它的自我注意机制（MSA）可以提取不同像素块的歧视性标记信息，以提高图像分类精度。但是，其深层中的分类标记往往会忽略层之间的局部特征。另外，嵌入层将是固定大小的像素块。输入网络不可避免地引入了其他图像噪声。为此，我们根据数据扩展研究了数据增强视觉变压器（DAVT），并提出了一种数据增强方法，以进行注意力裁剪，该方法使用注意力重量作为作物图像的指南并提高网络学习关键特征的能力。其次，我们还提出了一种分层注意选择（HAS）方法，从而提高了通过过滤和在级别之间进行融合的学习水平之间判别标记的能力。实验结果表明，在两个通用数据集（CUB-200-2011和Stanford Dogs）上，该方法的准确性优于现有主流方法，其精度分别比原始VIT高1.4 \％和1.6 \％

Recently, the vision transformer (ViT) has made breakthroughs in image recognition. Its self-attention mechanism (MSA) can extract discriminative labeling information of different pixel blocks to improve image classification accuracy. However, the classification marks in their deep layers tend to ignore local features between layers. In addition, the embedding layer will be fixed-size pixel blocks. Input network Inevitably introduces additional image noise. To this end, we study a data augmentation vision transformer (DAVT) based on data augmentation and proposes a data augmentation method for attention cropping, which uses attention weights as the guide to crop images and improve the ability of the network to learn critical features. Secondly, we also propose a hierarchical attention selection (HAS) method, which improves the ability of discriminative markers between levels of learning by filtering and fusing labels between levels. Experimental results show that the accuracy of this method on the two general datasets, CUB-200-2011, and Stanford Dogs, is better than the existing mainstream methods, and its accuracy is 1.4\% and 1.6\% higher than the original ViT, respectively

下载PDF全文

下载文献需遵守相关版权规定

论文标题