论文标题
通过跨模式脱离的综合逼真的虚拟人类
Synthesizing Photorealistic Virtual Humans Through Cross-modal Disentanglement
论文作者
论文摘要
在过去的几十年中,从亚马逊的Alexa和Apple的Siri等数字助理的出现到重新命名的Meta的最新元元努力,人类生活的许多方面都得到了增强。这些趋势强调了产生对人类的影视视觉描述的重要性。这导致了近年来所谓的Deepfake和说话头产生方法的快速增长。尽管它们令人印象深刻和受欢迎程度,但它们通常缺乏某些定性方面,例如纹理质量,嘴唇同步或解决方案以及实时运行的实用方面。为了允许虚拟人类化身在实际场景中使用,我们提出了一个端到端框架,用于合成能够以准确的唇部运动对话的高质量虚拟人脸,并特别强调性能。我们介绍了一个新的网络,该网络利用Visemes作为中间音频表示和一种新型的数据增强策略,该策略采用了层次图像合成方法,允许解散用于控制全球头部运动的不同模态。我们的方法是实时运行的,与当前的最新方法相比,我们能够提供卓越的结果。
Over the last few decades, many aspects of human life have been enhanced with virtual domains, from the advent of digital assistants such as Amazon's Alexa and Apple's Siri to the latest metaverse efforts of the rebranded Meta. These trends underscore the importance of generating photorealistic visual depictions of humans. This has led to the rapid growth of so-called deepfake and talking-head generation methods in recent years. Despite their impressive results and popularity, they usually lack certain qualitative aspects such as texture quality, lips synchronization, or resolution, and practical aspects such as the ability to run in real-time. To allow for virtual human avatars to be used in practical scenarios, we propose an end-to-end framework for synthesizing high-quality virtual human faces capable of speaking with accurate lip motion with a special emphasis on performance. We introduce a novel network utilizing visemes as an intermediate audio representation and a novel data augmentation strategy employing a hierarchical image synthesis approach that allows disentanglement of the different modalities used to control the global head motion. Our method runs in real-time, and is able to deliver superior results compared to the current state-of-the-art.