用多模式的自我判断从头开始标记未标记的视频

论文标题

用多模式的自我判断从头开始标记未标记的视频

Labelling unlabelled videos from scratch with multi-modal self-supervision

论文作者

Asano, Yuki M., Patrick, Mandela, Rupprecht, Christian, Vedaldi, Andrea

论文摘要

当前深度学习成功的很大一部分在于数据的有效性 - 更准确地说：标记的数据。然而，用人类注释标记数据集的标签继续持续高昂的成本，尤其是对于视频而言。尽管在图像域中，最近的方法已允许在没有监督的情况下为未经标记的数据集生成有意义的（伪）标签，而该视频域缺少此开发，其中学习特征表示是当前的焦点。在这项工作中，我们a）表明，视频数据集的无监督标签并非没有强大的功能编码器，b）提出了一种新型的聚类方法，可以通过利用音频和视觉模态之间的自然对应性来允许对视频数据集进行伪标记。广泛的分析表明，由此产生的群集具有很高的语义重叠，以与人类标签相对。我们进一步介绍了关于常见视频数据集动力学，动力学声音，VGG-SOND和AVE的无监督标签的第一个基准测试结果。

A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题