论文标题
对场景细分的多移窗户上的自我发作
Self-attention on Multi-Shifted Windows for Scene Segmentation
论文作者
论文摘要
图像中的场景细分是视觉内容理解中的一个基本而又具有挑战性的问题,即学习一个模型,将每个图像像素分配给分类标签。这项学习任务的挑战之一是考虑获得描述性特征表示的空间和语义关系,因此从多个尺度学习特征图是场景细分中的一种常见实践。在本文中,我们探讨了在多尺度图像窗口中自我发挥的有效使用来学习描述性视觉特征,然后提出了三种不同的策略来汇总这些特征图以解码特征表示形式以进行密集的预测。我们的设计基于最近提出的SWIN Transformer模型,该模型完全放弃了卷积操作。借助简单而有效的多尺度功能学习和聚合,我们的模型在四个公共场景细分数据集(Pascal VOC2012,Coco-STUFF 10K,ADE20K和CITYSCAPES)上实现了非常有希望的性能。
Scene segmentation in images is a fundamental yet challenging problem in visual content understanding, which is to learn a model to assign every image pixel to a categorical label. One of the challenges for this learning task is to consider the spatial and semantic relationships to obtain descriptive feature representations, so learning the feature maps from multiple scales is a common practice in scene segmentation. In this paper, we explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features, then propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction. Our design is based on the recently proposed Swin Transformer models, which totally discards convolution operations. With the simple yet effective multi-scale feature learning and aggregation, our models achieve very promising performance on four public scene segmentation datasets, PASCAL VOC2012, COCO-Stuff 10K, ADE20K and Cityscapes.