论文标题

部分可观测时空混沌系统的无模型预测

Hierarchical Text-Conditional Image Generation with CLIP Latents

论文作者

Ramesh, Aditya, Dhariwal, Prafulla, Nichol, Alex, Chu, Casey, Chen, Mark

论文摘要

已显示诸如剪辑之类的对比模型可以学习捕获语义和样式的图像的强大表示。为了利用这些表示形式进行图像生成,我们提出了一个两阶段的模型:先验生成给定文本字幕的剪辑图像嵌入的剪辑图像,以及在图像嵌入中生成图像的解码器。我们表明,明确生成图像表示可以改善图像多样性,而光真相和标题相似性的损失最小。我们以图像表示为条件的解码器还可以产生图像的变化,该图像既保留其语义和样式,又可以改变图像表示中所没有的非必需细节。此外,剪辑的联合嵌入空间可以以零拍的方式实现语言引导的图像操作。我们为解码器使用扩散模型,并对先前的自回归和扩散模型进行实验,发现后者在计算上更有效并产生更高质量的样品。

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源