论文标题
Styletalker:基于单拍的音频驱动的会说话的头视频创作
StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation
论文作者
论文摘要
我们提出了Styletalker,这是一种新颖的音频驱动的会说话的头部生成模型,可以从单个参考图像中综合一个会说话的人的视频,并具有准确的音频同步的唇形,逼真的头姿势和眼睛眨眼。具体而言,通过利用验证的图像发生器和图像编码器,我们估计了会说话的头部视频的潜在代码,这些代码忠实地反映了给定的音频。通过几个新设计的组件:1)具有准确的唇部同步的对比性唇部同步鉴别器,2)一种有条件的顺序变分自动码编码器,该对比唇 - 同步歧视器,它可以从唇部运动中学习潜在的运动空间,从而使我们可以独立地操纵运动和唇部运动,同时可以在确定身份下进行身份。 3)自动回归事先以标准化流量增强,以学习复杂的音频到运动多模式潜在空间。配备了这些组件,Styletalker不仅可以在给出另一个运动源视频时以动作控制的方式来生成会说话的头视频,而且还可以通过从输入音频中推断出现实的动作,以完全由音频驱动的方式生成。通过大量的实验和用户研究,我们表明我们的模型能够以令人印象深刻的感知质量合成会说话的头部视频,这些视频与输入音频相符,可以准确地唇部同步,在很大程度上表现优于先进的基准。
We propose StyleTalker, a novel audio-driven talking head generation model that can synthesize a video of a talking person from a single reference image with accurately audio-synced lip shapes, realistic head poses, and eye blinks. Specifically, by leveraging a pretrained image generator and an image encoder, we estimate the latent codes of the talking head video that faithfully reflects the given audio. This is made possible with several newly devised components: 1) A contrastive lip-sync discriminator for accurate lip synchronization, 2) A conditional sequential variational autoencoder that learns the latent motion space disentangled from the lip movements, such that we can independently manipulate the motions and lip movements while preserving the identity. 3) An auto-regressive prior augmented with normalizing flow to learn a complex audio-to-motion multi-modal latent space. Equipped with these components, StyleTalker can generate talking head videos not only in a motion-controllable way when another motion source video is given but also in a completely audio-driven manner by inferring realistic motions from the input audio. Through extensive experiments and user studies, we show that our model is able to synthesize talking head videos with impressive perceptual quality which are accurately lip-synced with the input audios, largely outperforming state-of-the-art baselines.