朝着强大的参考图像分割

论文标题

朝着强大的参考图像分割

Towards Robust Referring Image Segmentation

论文作者

Wu, Jianzong, Li, Xiangtai, Li, Xia, Ding, Henghui, Tong, Yunhai, Tao, Dacheng

论文摘要

参考图像分割（RIS）是一项基本的视觉语言任务，可根据文本说明输出对象掩码。 RIS的许多作品都取得了很大进展，包括不同的融合方法设计。在这项工作中，我们探讨了一个基本问题：``例如，文本描述是错误的还是误导的？''，例如，所描述的对象不在图像中。我们将这样的句子称为否定句子。但是，现有的RIS解决方案无法处理这种设置。为此，我们提出了一种新的RIS公式，称为Robust Robust Toemust totring图像分割（R-RIS）。除了常规的正文输入外，它还考虑了否定的句子输入。为了促进这项新任务，我们通过增加具有负面句子的现有RIS数据集并提出新指标来以统一的方式评估两种类型的输入，从而创建三个RIS数据集。此外，我们提出了一个新的基于变压器的模型，称为RefSegformer，具有基于令牌的视觉和语言融合模块。通过添加额外的空白令牌，我们的设计很容易扩展到我们的R-RIS设置。我们提出的RefSegormer在RIS和R-RIS数据集上实现了最先进的结果，为这两种设置建立了坚实的基线。我们的项目页面位于\ url {https://github.com/jianzongwu/robust-ref-seg}。

Referring Image Segmentation (RIS) is a fundamental vision-language task that outputs object masks based on text descriptions. Many works have achieved considerable progress for RIS, including different fusion method designs. In this work, we explore an essential question, ``What if the text description is wrong or misleading?'' For example, the described objects are not in the image. We term such a sentence as a negative sentence. However, existing solutions for RIS cannot handle such a setting. To this end, we propose a new formulation of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the negative sentence inputs besides the regular positive text inputs. To facilitate this new task, we create three R-RIS datasets by augmenting existing RIS datasets with negative sentences and propose new metrics to evaluate both types of inputs in a unified manner. Furthermore, we propose a new transformer-based model, called RefSegformer, with a token-based vision and language fusion module. Our design can be easily extended to our R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves state-of-the-art results on both RIS and R-RIS datasets, establishing a solid baseline for both settings. Our project page is at \url{https://github.com/jianzongwu/robust-ref-seg}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题