sadtalker：学习现实的3D运动系数，用于风格化音频驱动的单图像说话面部动画

论文标题

sadtalker：学习现实的3D运动系数，用于风格化音频驱动的单图像说话面部动画

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

论文作者

Zhang, Wenxuan, Cun, Xiaodong, Wang, Xuan, Zhang, Yong, Shen, Xi, Guo, Yu, Shan, Ying, Wang, Fei

论文摘要

通过面部图像产生说话的头视频和一段语音音频仍然包含许多挑战。即，不自然的头部运动，表达扭曲和身份修饰。我们认为这些问题主要是由于从耦合的2D运动字段中学习。另一方面，明确使用3D信息也遇到了僵硬的表达和不连贯的视频问题。我们提出了Sadtalker，它会从音频产生3DMM的3D运动系数（头姿势，表达），并隐式调节了一种新颖的3D感知面部效果，以使其发电。为了了解现实的运动系数，我们明确地对音频和不同类型的运动系数之间的连接进行了明确的建模。确切地说，我们提出了expnet，以通过蒸馏系数和3D渲染的面孔来了解音频的准确面部表达。至于头部姿势，我们通过条件VAE设计posevae，以合成不同样式的头部运动。最后，将生成的3D运动系数映射到拟议的面部渲染的无监督的3D关键点空间，并合成最终视频。我们进行了广泛的实验，以在运动和视频质量方面证明我们方法的优势。

Generating talking head videos through a face image and a piece of speech audio still contains many challenges. ie, unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly because of learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a conditional VAE to synthesize head motion in different styles. Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of the proposed face render, and synthesize the final video. We conducted extensive experiments to demonstrate the superiority of our method in terms of motion and video quality.

下载PDF全文

下载文献需遵守相关版权规定

论文标题