论文标题
文本驱动图像到图像翻译的插件扩散功能
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
论文作者
论文摘要
大规模的文本到图像生成模型在生成AI的发展中一直是革命性的突破,使我们能够合成传达高度复杂的视觉概念的各种图像。但是,在利用此类模型进行现实世界内容创建任务时,一个关键的挑战是为用户提供对生成内容的控制。在本文中,我们提出了一个新的框架,该框架将文本对图像综合综合到图像到图像翻译的领域 - 给定指导图像和目标文本提示,我们的方法可以利用预先训练的文本对图像扩散模型的力量,以生成一个新的图像,该模型符合目标文本,同时保留源源图像的语义布局。具体而言,我们观察到并从经验上证明,可以通过操纵空间特征及其在模型内的自我注意力来实现对生成结构的细粒度控制。这导致了一种简单有效的方法,其中从指导图像中提取的特征直接注入了目标图像的生成过程中,不需要培训或微调,并且适用于真实或生成的指导图像。我们在多功能文本指导的图像翻译任务上展示了高质量的结果,包括将草图,粗图纸和动画翻译成逼真的图像,在给定图像中更改类和对象的外观以及对全球质量(例如照明和颜色)的修改。
Large-scale text-to-image generative models have been a revolutionary breakthrough in the evolution of generative AI, allowing us to synthesize diverse images that convey highly complex visual concepts. However, a pivotal challenge in leveraging such models for real-world content creation tasks is providing users with control over the generated content. In this paper, we present a new framework that takes text-to-image synthesis to the realm of image-to-image translation -- given a guidance image and a target text prompt, our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text, while preserving the semantic layout of the source image. Specifically, we observe and empirically demonstrate that fine-grained control over the generated structure can be achieved by manipulating spatial features and their self-attention inside the model. This results in a simple and effective approach, where features extracted from the guidance image are directly injected into the generation process of the target image, requiring no training or fine-tuning and applicable for both real or generated guidance images. We demonstrate high-quality results on versatile text-guided image translation tasks, including translating sketches, rough drawings and animations into realistic images, changing of the class and appearance of objects in a given image, and modifications of global qualities such as lighting and color.