通过对比集学习以对象为中心的视频模型

论文标题

通过对比集学习以对象为中心的视频模型

Learning Object-Centric Video Models by Contrasting Sets

论文作者

Löwe, Sindy, Greff, Klaus, Jonschkowski, Rico, Dosovitskiy, Alexey, Kipf, Thomas

论文摘要

对象表征的对比，自我监督的学习最近成为基于重建的培训的有吸引力的替代方法。先前的方法将重点放在将单个对象表示（插槽）与彼此相比。但是，这种方法的一个基本问题是，对于（i）表示每个插槽中一个不同对象的总体对比度损失是相同的，因为它代表所有插槽中代表相同对象的（ii）（re-）。因此，该目标并不能固有地推向插槽中以对象为中心表示的出现。我们通过引入一个基于全球的对比损失来解决这个问题：我们没有将单个插槽表示相互对比，而是汇总了表示形式，并将加入的集合相互对比。此外，我们将基于注意力的编码器介绍给此对比度设置，该设置简化了训练并提供了可解释的对象掩码。我们在两个合成视频数据集上的结果表明，在重建，未来预测和对象分离性能方面，这种方法与以前的对比方法进行了优惠相比。

Contrastive, self-supervised learning of object representations recently emerged as an attractive alternative to reconstruction-based training. Prior approaches focus on contrasting individual object representations (slots) against one another. However, a fundamental problem with this approach is that the overall contrastive loss is the same for (i) representing a different object in each slot, as it is for (ii) (re-)representing the same object in all slots. Thus, this objective does not inherently push towards the emergence of object-centric representations in the slots. We address this problem by introducing a global, set-based contrastive loss: instead of contrasting individual slot representations against one another, we aggregate the representations and contrast the joined sets against one another. Additionally, we introduce attention-based encoders to this contrastive setup which simplifies training and provides interpretable object masks. Our results on two synthetic video datasets suggest that this approach compares favorably against previous contrastive methods in terms of reconstruction, future prediction and object separation performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题