语义意义的隐性神经音频驱动视频肖像生成

论文标题

语义意义的隐性神经音频驱动视频肖像生成

Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation

论文作者

Liu, Xian, Xu, Yinghao, Wu, Qianyi, Zhou, Hang, Wu, Wayne, Zhou, Bolei

论文摘要

用语音音频为高保真视频肖像画动画对虚拟现实和数字娱乐至关重要。尽管大多数以前的研究都依赖于准确的明确结构信息，但最近的作品探讨了神经辐射场（NERF）的隐式场景表示，以实现现实。为了捕获不一致的动作以及人头和躯干之间的语义差异，有些工作通过两组nerf组对其进行了模拟，从而导致不自然的结果。在这项工作中，我们建议使用一套统一的NERF组创建精致的音频驱动肖像，并提出语义意识的肖像NERF（SSP-NERF）。提出的模型可以通过两个语义吸引的模块来处理详细的本地面部语义和全球头痛关系。具体而言，我们首先提出了一个语义感知的动态射线采样模块，并具有额外的解析分支，以促进音频驱动的音量渲染。此外，为了在一个统一的神经辐射场中启用肖像渲染，躯干变形模块旨在稳定大规模的非刚性躯干运动。广泛的评估表明，与以前的方法相比，我们提出的方法使更现实的视频肖像更现实。项目页面：https：//alvinliu0.github.io/projects/ssp-nerf

Animating high-fidelity video portrait with speech audio is crucial for virtual reality and digital entertainment. While most previous studies rely on accurate explicit structural information, recent works explore the implicit scene representation of Neural Radiance Fields (NeRF) for realistic generation. In order to capture the inconsistent motions as well as the semantic difference between human head and torso, some work models them via two individual sets of NeRF, leading to unnatural results. In this work, we propose Semantic-aware Speaking Portrait NeRF (SSP-NeRF), which creates delicate audio-driven portraits using one unified set of NeRF. The proposed model can handle the detailed local facial semantics and the global head-torso relationship through two semantic-aware modules. Specifically, we first propose a Semantic-Aware Dynamic Ray Sampling module with an additional parsing branch that facilitates audio-driven volume rendering. Moreover, to enable portrait rendering in one unified neural radiance field, a Torso Deformation module is designed to stabilize the large-scale non-rigid torso motions. Extensive evaluations demonstrate that our proposed approach renders more realistic video portraits compared to previous methods. Project page: https://alvinliu0.github.io/projects/SSP-NeRF

下载PDF全文

下载文献需遵守相关版权规定

论文标题