仔细观察弱监督的视听源本地化

论文标题

仔细观察弱监督的视听源本地化

A Closer Look at Weakly-Supervised Audio-Visual Source Localization

论文作者

Mo, Shentong, Morgado, Pedro

论文摘要

视听源本地化是一项具有挑战性的任务，旨在预测视频中视觉源的位置。由于收集声音对象的基本真相注释可能是昂贵的，因此近年来已经提出了许多弱监督的本地化方法，这些方法可以从数据集中学习，通过利用自然的音频和视觉信号，提出了无界盒注释。尽管引起了人们的兴趣，但流行的评估方案仍存在两个主要缺陷。首先，它们允许使用完全注释的数据集进行早期停止，从而大大增加培训所需的注释工作。其次，当前的评估指标始终假设存在声源。当然，这是一个不切实际的假设，因此，对于捕获模型在没有可见声源的（负）样本上的性能中，必须更好的指标是必要的。为此，我们扩展了流行基准，Flickr Soundnet和VGG-SOUND源的测试集，以包括负样本，并使用平衡定位准确性和回忆的指标来衡量性能。使用新协议，我们对先前方法进行了广泛的评估，发现大多数先前的工作都无法识别负面因素并遭受严重的过度拟合问题（严重依赖早期停止以获得最佳结果）。我们还提出了一种新的视觉声源本地化方法，可以解决这两个问题。特别是，我们发现，通过极端的视觉辍学和动量编码器的使用，提出的方法有效地打击了过度拟合，并在Flickr Soundnet和VGG-SOUND源上建立了新的最新性能。代码和预培训模型可在https://github.com/stonemo/slavc上找到。

Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video. Since collecting ground-truth annotations of sounding objects can be costly, a plethora of weakly-supervised localization methods that can learn from datasets with no bounding-box annotations have been proposed in recent years, by leveraging the natural co-occurrence of audio and visual signals. Despite significant interest, popular evaluation protocols have two major flaws. First, they allow for the use of a fully annotated dataset to perform early stopping, thus significantly increasing the annotation effort required for training. Second, current evaluation metrics assume the presence of sound sources at all times. This is of course an unrealistic assumption, and thus better metrics are necessary to capture the model's performance on (negative) samples with no visible sound sources. To accomplish this, we extend the test set of popular benchmarks, Flickr SoundNet and VGG-Sound Sources, in order to include negative samples, and measure performance using metrics that balance localization accuracy and recall. Using the new protocol, we conducted an extensive evaluation of prior methods, and found that most prior works are not capable of identifying negatives and suffer from significant overfitting problems (rely heavily on early stopping for best results). We also propose a new approach for visual sound source localization that addresses both these problems. In particular, we found that, through extreme visual dropout and the use of momentum encoders, the proposed approach combats overfitting effectively, and establishes a new state-of-the-art performance on both Flickr SoundNet and VGG-Sound Source. Code and pre-trained models are available at https://github.com/stoneMo/SLAVC.

下载PDF全文

下载文献需遵守相关版权规定

论文标题