论文标题
学习动态的面部辐射场,用于几次说话的头综合
Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis
论文作者
论文摘要
Talking Head Synthesis是一项新兴技术,在电影配音,虚拟化身和在线教育中具有广泛的应用。最近基于NERF的方法会产生更自然的说话视频,因为它们更好地捕获了面部的3D结构信息。但是,需要使用大型数据集对每个身份进行特定模型。在本文中,我们提出了动态面部辐射场(DFRF),以进行几次交谈的头部合成,这可以在很少的训练数据中迅速概括为看不见的身份。与现有的基于NERF的方法不同,该方法直接编码特定人的3D几何形状和外观到网络中,我们的DFRF条件面对2D外观图像上的辐射场,以便先验学习面部。因此,可以通过很少的参考图像灵活地调整面部辐射场。此外,为了更好地对面部变形进行建模,我们提出了一个在音频信号条件下的可区分面翘曲模块,以使所有参考图像变形到查询空间。广泛的实验表明,只有数十秒钟的训练剪辑可用,我们提出的DFRF可以综合自然和高质量的音频驱动的会说话的头视频,用于只有40k迭代的新身份。我们强烈建议读者查看我们的补充视频以进行直观的比较。代码可在https://sstzal.github.io/dfrf/中找到。
Talking head synthesis is an emerging technology with wide applications in film dubbing, virtual avatars and online education. Recent NeRF-based methods generate more natural talking videos, as they better capture the 3D structural information of faces. However, a specific model needs to be trained for each identity with a large dataset. In this paper, we propose Dynamic Facial Radiance Fields (DFRF) for few-shot talking head synthesis, which can rapidly generalize to an unseen identity with few training data. Different from the existing NeRF-based methods which directly encode the 3D geometry and appearance of a specific person into the network, our DFRF conditions face radiance field on 2D appearance images to learn the face prior. Thus the facial radiance field can be flexibly adjusted to the new identity with few reference images. Additionally, for better modeling of the facial deformations, we propose a differentiable face warping module conditioned on audio signals to deform all reference images to the query space. Extensive experiments show that with only tens of seconds of training clip available, our proposed DFRF can synthesize natural and high-quality audio-driven talking head videos for novel identities with only 40k iterations. We highly recommend readers view our supplementary video for intuitive comparisons. Code is available in https://sstzal.github.io/DFRF/.