DATID-3D：使用3D生成模型的文本对图扩散的多样性保留的域适应

论文标题

DATID-3D：使用3D生成模型的文本对图扩散的多样性保留的域适应

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model

论文作者

Kim, Gwanghyun, Chun, Se Young

论文摘要

最近的3D生成模型在综合具有视图一致性和详细3D形状的高分辨率的感性图像方面取得了显着的性能，但是训练它们为各种域进行了挑战，因为它需要大量的训练图像及其相机分配信息。文本指导的域适应方法通过利用剪辑（对比性语言图像预训练）将一个域上的2D生成模型转换为其他域上的模型上的2D生成模型，而不是为那些域收集大量数据集。但是，它们的一个缺点是，由于剪辑文本编码器的确定性性质，原始生成模型中的样本多样性在域适应的生成模型中没有得到很好的保存。对于3D生成模型而言，文本指导的域的适应不仅是由于灾难性的多样性丧失，而且还因为文本图像较低的对应关系和图像质量差而更具挑战性。在这里，我们建议使用文本对图像扩散模型为3D生成模型量身定制的域适应方法DATID-3D，该方法可以在无需收集目标域的其他图像和摄像机信息的情况下综合每个文本提示的各种图像。与先前文本指导的域适应方法的3D扩展不同，我们的新型管道能够微调源域的最先进的3D生成器，以合成高分辨率，多视图一致的目标有针对性域中的一致图像，而无需进行其他数据，从而超过了现有的文本适应方法，从而超过了多样性和文本对象。此外，我们提出并展示了不同的3D图像操作，例如单次实例选择的适应和单视3D重建，以完全享受文本多样性。

Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information. Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a domain adaptation method tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text.

下载PDF全文

下载文献需遵守相关版权规定

论文标题