图像和视频的姿势，外观和背景无监督的解开

论文标题

图像和视频的姿势，外观和背景无监督的解开

Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos

论文作者

Dundar, Aysegul, Shih, Kevin J., Garg, Animesh, Pottorf, Robert, Tao, Andrew, Catanzaro, Bryan

论文摘要

无监督的地标学习是学习语义关键点状表示的任务，而无需使用昂贵的输入关键点级注释。一种流行的方法是将图像分配到姿势和外观数据流中，然后从分解的组件中重建图像。姿势表示应捕获一组一致且紧密的地标，以促进输入图像的重建。最终，我们希望我们博学的地标专注于感兴趣的前景对象。但是，整个图像的重建任务迫使模型分配地标对背景进行建模。这项工作探讨了将重建任务分配到单独的前景和背景重建中的效果，从而仅在无监督的地标上进行前景重建。我们的实验表明，提出的分解导致着眼于关注的前景对象的地标。此外，渲染的背景质量也得到了改善，因为背景渲染管道不再需要不合适的地标来建模其姿势和外观。我们在视频预测任务的背景下证明了这一改进。

Unsupervised landmark learning is the task of learning semantic keypoint-like representations without the use of expensive input keypoint-level annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in order to facilitate reconstruction of the input image. Ultimately, we wish for our learned landmarks to focus on the foreground object of interest. However, the reconstruction task of the entire image forces the model to allocate landmarks to model the background. This work explores the effects of factorizing the reconstruction task into separate foreground and background reconstructions, conditioning only the foreground reconstruction on the unsupervised landmarks. Our experiments demonstrate that the proposed factorization results in landmarks that are focused on the foreground object of interest. Furthermore, the rendered background quality is also improved, as the background rendering pipeline no longer requires the ill-suited landmarks to model its pose and appearance. We demonstrate this improvement in the context of the video-prediction task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题