Sherlock：自我监督的分层事件表示学习

论文标题

Sherlock：自我监督的分层事件表示学习

SHERLock: Self-Supervised Hierarchical Event Representation Learning

论文作者

Roychowdhury, Sumegh, Sontakke, Sumedh A., Puri, Nikaash, Sarkar, Mausoom, Aggarwal, Milan, Badjatiya, Pinkesh, Krishnamurthy, Balaji, Itti, Laurent

论文摘要

时间事件表示是人类学习的重要方面。它们允许通过各种感官输入简要地编码我们所拥有的经验。此外，据信它们是按层次安排的，可以有效地表示复杂的长途体验。此外，这些表示是以一种自制的方式获得的。类似地，在这里，我们提出了一个模型，该模型从长远的视觉演示数据和相关的文本描述中学习时间表示，而无需明确的时间监督。我们的方法产生的表示形式的层次结构与最先进的无监督基线相比，与地面真实的人类通知事件（+15.3）更紧密地保持一致。我们的结果可与复杂的视觉域中的大量监督基线相媲美，例如国际象棋开口，YouCook2和TutorialVQA数据集。最后，我们进行了消融研究，说明了方法的鲁棒性。我们在补充材料中释放代码和演示可视化。

Temporal event representations are an essential aspect of learning among humans. They allow for succinct encoding of the experiences we have through a variety of sensory inputs. Also, they are believed to be arranged hierarchically, allowing for an efficient representation of complex long-horizon experiences. Additionally, these representations are acquired in a self-supervised manner. Analogously, here we propose a model that learns temporal representations from long-horizon visual demonstration data and associated textual descriptions, without explicit temporal supervision. Our method produces a hierarchy of representations that align more closely with ground-truth human-annotated events (+15.3) than state-of-the-art unsupervised baselines. Our results are comparable to heavily-supervised baselines in complex visual domains such as Chess Openings, YouCook2 and TutorialVQA datasets. Finally, we perform ablation studies illustrating the robustness of our approach. We release our code and demo visualizations in the Supplementary Material.

下载PDF全文

下载文献需遵守相关版权规定

论文标题