论文标题
与语义控制的组成人类场所相互作用合成
Compositional Human-Scene Interaction Synthesis with Semantic Control
论文作者
论文摘要
综合虚拟人类及其3D环境之间的自然互动对于众多应用程序(例如计算机游戏和AR/VR体验)至关重要。我们的目标是使人类与给定的3D场景进行互动,该场景由高级语义规格控制,作为成对的动作类别和对象实例,例如“坐在椅子上”。将相互作用语义纳入生成框架中的主要挑战是学习有效捕获异质信息的联合表示形式,包括人体的发音,3D对象几何以及相互作用的意图。为了应对这一挑战,我们设计了一种基于变压器的新型生成模型,其中铰接的3D人体表面点和3D对象共同编码在统一的潜在空间中,并且人与物体之间相互作用的语义是通过位置编码嵌入的。此外,受到人类可以与多个对象同时相互作用的相互作用的组成性质的启发,我们将相互作用语义定义为不同原子动作对象对的组成。我们提出的生成模型自然可以融合不同数量的原子相互作用,从而无需复合相互作用数据,可以合成组成的人类习惯相互作用。我们使用交互语义标签和场景实例分割扩展了Prox数据集,以评估我们的方法,并证明我们的方法可以通过语义控制生成现实的人类娱乐相互作用。我们的感知研究表明,我们合成的虚拟人类可以自然与3D场景相互作用,从而优于现有方法。我们将方法硬币命名,用于与语义控制的组成相互作用合成。代码和数据可在https://github.com/zkf1997/coins上找到。
Synthesizing natural interactions between virtual humans and their 3D environments is critical for numerous applications, such as computer games and AR/VR experiences. Our goal is to synthesize humans interacting with a given 3D scene controlled by high-level semantic specifications as pairs of action categories and object instances, e.g., "sit on the chair". The key challenge of incorporating interaction semantics into the generation framework is to learn a joint representation that effectively captures heterogeneous information, including human body articulation, 3D object geometry, and the intent of the interaction. To address this challenge, we design a novel transformer-based generative model, in which the articulated 3D human body surface points and 3D objects are jointly encoded in a unified latent space, and the semantics of the interaction between the human and objects are embedded via positional encoding. Furthermore, inspired by the compositional nature of interactions that humans can simultaneously interact with multiple objects, we define interaction semantics as the composition of varying numbers of atomic action-object pairs. Our proposed generative model can naturally incorporate varying numbers of atomic interactions, which enables synthesizing compositional human-scene interactions without requiring composite interaction data. We extend the PROX dataset with interaction semantic labels and scene instance segmentation to evaluate our method and demonstrate that our method can generate realistic human-scene interactions with semantic control. Our perceptual study shows that our synthesized virtual humans can naturally interact with 3D scenes, considerably outperforming existing methods. We name our method COINS, for COmpositional INteraction Synthesis with Semantic Control. Code and data are available at https://github.com/zkf1997/COINS.