看什么是：弱监督的开放世界短语接地没有文本输入

论文标题

看什么是：弱监督的开放世界短语接地没有文本输入

What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

论文作者

Shaharabany, Tal, Tewel, Yoad, Wolf, Lior

论文摘要

给定输入图像，没有其他的方法，我们的方法返回图像中的对象和描述对象的短语中的对象框。这是在开放世界范式中实现的，在该范式中，在本地化机制训练期间可能没有遇到输入图像中的对象。此外，培训是在弱监督的环境中进行的，那里没有界限。为了实现这一目标，我们的方法结合了两个预训练的网络：剪辑图像到文本匹配分数和BLIP图像字幕工具。培训是在可可图像及其标题上进行的，并基于剪辑。然后，在推断期间，BLIP用于生成有关当前图像各个区域的假设。我们的工作概括了弱监督的细分和短语接地，并在经验上表现出胜过两个领域中最新技术的表现。它还显示了我们作品中呈现出纯粹的开放世界纯粹的视觉短语基础的新任务中的令人信服的结果。例如，在用于基准词组接地的数据集上，与使用人体字幕作为附加输入的方法相比，我们的方法导致非常适中的降解。我们的代码可在https://github.com/talshaharabany/what-is-where-by-looking上找到，可以在https://replate.com/talshaharabany/what-is-where-is-where-by-by-looking上找到实时演示。

Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects. This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism. Moreover, training takes place in a weakly supervised setting, where no bounding boxes are provided. To achieve this, our method combines two pre-trained networks: the CLIP image-to-text matching score and the BLIP image captioning tool. Training takes place on COCO images and their captions and is based on CLIP. Then, during inference, BLIP is used to generate a hypothesis regarding various regions of the current image. Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains. It also shows very convincing results in the novel task of weakly-supervised open-world purely visual phrase-grounding presented in our work. For example, on the datasets used for benchmarking phrase-grounding, our method results in a very modest degradation in comparison to methods that employ human captions as an additional input. Our code is available at https://github.com/talshaharabany/what-is-where-by-looking and a live demo can be found at https://replicate.com/talshaharabany/what-is-where-by-looking.

下载PDF全文

下载文献需遵守相关版权规定

论文标题