EDIFF-I：具有专家Denoisers合奏的文本到图像扩散模型

论文标题

EDIFF-I：具有专家Denoisers合奏的文本到图像扩散模型

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

论文作者

Balaji, Yogesh, Nah, Seungjun, Huang, Xun, Vahdat, Arash, Song, Jiaming, Zhang, Qinsheng, Kreis, Karsten, Aittala, Miika, Aila, Timo, Laine, Samuli, Catanzaro, Bryan, Karras, Tero, Liu, Ming-Yu

论文摘要

基于大规模扩散的生成模型已导致文本条件高分辨率图像合成的突破。从随机噪声开始，这种文本对图像扩散模型在根据文本提示下进行调节时以迭代方式逐渐合成图像。我们发现它们的合成行为在整个过程中都会发生质量变化：在抽样的早期，一代强烈依赖文本提示来生成与文本调整的内容产生的内容，而后来，文本条件几乎完全忽略了。这表明在整个生成过程中共享模型参数可能不是理想的。因此，与现有作品相反，我们建议训练专门用于不同合成阶段的文本对图像扩散模型的集合。为了维持培训效率，我们最初训练单个模型，然后将其分为专门模型，该模型经过培训的迭代生成过程的特定阶段。我们的扩散模型合奏（称为EDIFF-I）可改善文本对齐，同时保持相同的推理计算成本并保留高视觉质量，在标准基准上表现优于以前的大规模文本对图像扩散模型。此外，我们训练我们的模型以利用各种嵌入以进行调节，包括T5文本，剪辑文本和剪辑图像嵌入。我们表明这些不同的嵌入会导致不同的行为。值得注意的是，剪辑映像嵌入允许一种直观的方式将参考图像的样式传输到目标文本到图像输出。最后，我们展示了一种使Efiff-I的“涂料字”功能的技术。用户可以在输入文本中选择单词并在画布中绘制以控制输出，这对于牢记所需图像非常方便。项目页面可从https://deepimagination.cc/ediff-i/获得

Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/

下载PDF全文

下载文献需遵守相关版权规定

论文标题