论文标题
Lawin Transformer:通过大窗户注意通过多尺度表示来改善语义分割变压器
Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention
论文作者
论文摘要
多尺度表示对于语义分割至关重要。社区目睹了开利多规模上下文信息的语义细分卷积神经网络(CNN)的蓬勃发展。由于视觉变压器(VIT)在图像分类方面具有强大的动力,最近提出了一些语义分割VIT,其中大多数取得了令人印象深刻的结果,但以计算经济为代价。在本文中,我们成功地通过窗户注意机制将多尺度表示形式引入语义分割VIT中,并进一步提高了性能和效率。为此,我们引入了较大的窗户注意力,该窗口的关注使本地窗口仅在一个小时的开销中查询上下文窗口的较大区域。通过调节上下文区域与查询区域的比率,我们启用$ \ textit {大窗口注意} $以在多个尺度上捕获上下文信息。 Moreover, the framework of spatial pyramid pooling is adopted to collaborate with $\textit{the large window attention}$, which presents a novel decoder named $\textbf{la}$rge $\textbf{win}$dow attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT.我们由此产生的VIT,Lawin Transformer,由有效的层次视觉变压器(HVT)作为编码器,将LawinAspp作为解码器组成。经验结果表明,与现有方法相比,Lawin Transformer提供了提高的效率。 Lawin Transformer进一步在CityScapes(84.4%MIOU),ADE20K(56.2%MIOU)和COCO-STUFF数据集上设定了新的最先进的性能。该代码将在https://github.com/yan-hao-tian/lawin上发布
Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the $\textit{large window attention}$ to capture the contextual information at multiple scales. Moreover, the framework of spatial pyramid pooling is adopted to collaborate with $\textit{the large window attention}$, which presents a novel decoder named $\textbf{la}$rge $\textbf{win}$dow attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT. Our resulting ViT, Lawin Transformer, is composed of an efficient hierachical vision transformer (HVT) as encoder and a LawinASPP as decoder. The empirical results demonstrate that Lawin Transformer offers an improved efficiency compared to the existing method. Lawin Transformer further sets new state-of-the-art performance on Cityscapes (84.4% mIoU), ADE20K (56.2% mIoU) and COCO-Stuff datasets. The code will be released at https://github.com/yan-hao-tian/lawin