草稿和思考：有效的图像生成与上下文RQ-Transformer

论文标题

草稿和思考：有效的图像生成与上下文RQ-Transformer

Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer

论文作者

Lee, Doyup, Kim, Chiheon, Kim, Saehoon, Cho, Minsu, Han, Wook-Shin

论文摘要

尽管自回归模型在图像生成上取得了令人鼓舞的结果，但它们的单向生成过程阻止了所得图像完全反映全球环境。为了解决这个问题，我们提出了一个有效的图像生成框架，该框架与上下文RQ转换器的草稿和革命框架在生成过程中考虑了全局上下文。作为广义的VQ-VAE，RQ-VAE首先将高分辨率图像表示为一系列离散代码堆栈。在序列中的代码堆栈被随机掩盖后，对上下文rq变形器进行了训练，以根据图像的未掩盖上下文来填充蒙版的代码堆栈。然后，上下文的RQ-Transformer使用我们的两阶段解码，草稿和重新定位并生成图像，同时在生成过程中利用图像的全局上下文。具体来说。在草稿阶段，尽管质量相当低，但我们的模型首先着重于产生各种图像。然后，在修订阶段，模型迭代地改善了图像的质量，同时保留了生成的图像的全局环境。在实验中，我们的方法在有条件的图像生成上实现了最新的结果。我们还验证了通过有效控制图像生成中质量多样性权衡的质量多样性来实现的草稿解码可以实现高性能。

Although autoregressive models have achieved promising results on image generation, their unidirectional generation process prevents the resultant images from fully reflecting global contexts. To address the issue, we propose an effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process. As a generalized VQ-VAE, RQ-VAE first represents a high-resolution image as a sequence of discrete code stacks. After code stacks in the sequence are randomly masked, Contextual RQ-Transformer is trained to infill the masked code stacks based on the unmasked contexts of the image. Then, Contextual RQ-Transformer uses our two-phase decoding, Draft-and-Revise, and generates an image, while exploiting the global contexts of the image during the generation process. Specifically. in the draft phase, our model first focuses on generating diverse images despite rather low quality. Then, in the revise phase, the model iteratively improves the quality of images, while preserving the global contexts of generated images. In experiments, our method achieves state-of-the-art results on conditional image generation. We also validate that the Draft-and-Revise decoding can achieve high performance by effectively controlling the quality-diversity trade-off in image generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题