dall-e检测：语言驱动的组成图像合成用于对象检测

论文标题

dall-e检测：语言驱动的组成图像合成用于对象检测

DALL-E for Detection: Language-driven Compositional Image Synthesis for Object Detection

论文作者

Ge, Yunhao, Xu, Jiashu, Zhao, Brian Nlong, Joshi, Neel, Itti, Laurent, Vineet, Vibhav

论文摘要

我们提出了一个新的范式，以使用文本兼容综合框架（例如，dall-e，稳定的扩散等）自动生成具有准确标签的训练数据。提出的方法将培训数据生成分解为前景对象掩盖的生成和背景（上下文）图像生成。对于前景对象蒙版生成，我们使用一个带有对象类名称的简单文本模板作为DALL-E的输入来生成各种各样的前景图像。然后使用前景 - 背景分割算法来生成前景对象掩模。接下来，为了生成上下文图像，首先通过在代表上下文的一小部分图像上应用图像字幕方法来生成上下文的语言描述。然后，这些语言描述用于使用DALL-E框架生成各种上下文图像集。然后将它们与第一步中生成的对象面具合成，以提供分类器的增强培训集。我们在四个对象检测数据集上演示了方法的优势，包括Pascal VOC和可可对象检测任务。此外，我们还强调了数据生成方法对分布和零摄像数据生成方案的组成性质。

We propose a new paradigm to automatically generate training data with accurate labels at scale using the text-toimage synthesis frameworks (e.g., DALL-E, Stable Diffusion, etc.). The proposed approach decouples training data generation into foreground object mask generation and background (context) image generation. For foreground object mask generation, we use a simple textual template with object class name as input to DALL-E to generate a diverse set of foreground images. A foreground-background segmentation algorithm is then used to generate foreground object masks. Next, in order to generate context images, first a language description of the context is generated by applying an image captioning method on a small set of images representing the context. These language descriptions are then used to generate diverse sets of context images using the DALL-E framework. These are then composited with object masks generated in the first step to provide an augmented training set for a classifier. We demonstrate the advantages of our approach on four object detection datasets including on Pascal VOC and COCO object detection tasks. Furthermore, we also highlight the compositional nature of our data generation approach on out-of-distribution and zero-shot data generation scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题