沿着视听事件线的对比阳性样品传播

论文标题

沿着视听事件线的对比阳性样品传播

Contrastive Positive Sample Propagation along the Audio-Visual Event Line

论文作者

Zhou, Jinxing, Guo, Dan, Wang, Meng

论文摘要

视觉和音频信号通常在自然环境中共存，形成视听事件（AVES）。给定视频，我们旨在将包含AVE的视频片段定位并确定其类别。学习每个视频段的判别特征是关键的。与现有关注视听特征融合的现有工作不同，在本文中，我们提出了一种新的对比阳性样品传播（CPSP）方法，以提供更好的深度特征表示学习。 CPSP的贡献是将可用的完整或弱标签引入先验，以构建确切的阳性阴性样本进行对比学习。具体而言，CPSP涉及全面的对比限制：成对级别的阳性样品传播（PSP），段级和视频级别的正阳性样品激活（PSA $ _S $和PSA $ _V $）。提出了三个新的对比目标（\ emph {i.e。}，$ \ Mathcal {l} _ {\ text {avpsp}} $，$ \ MATHCAL {l} _ \ text _ \ text {spsa} $本地化。为了绘制AVE本地化中对比度学习的完整图片，我们还研究了自我保护的阳性样品传播（SSPSP）。结果，CPSP更有帮助获得与负面因素可区分的精致视听特征，从而使分类器预测受益。在AVE和新收集的VGGSOUND-AVEL100K数据集上进行了广泛的实验，验证了我们方法的有效性和泛化能力。

Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. It is pivotal to learn the discriminative features for each video segment. Unlike existing work focusing on audio-visual feature fusion, in this paper, we propose a new contrastive positive sample propagation (CPSP) method for better deep feature representation learning. The contribution of CPSP is to introduce the available full or weak label as a prior that constructs the exact positive-negative samples for contrastive learning. Specifically, the CPSP involves comprehensive contrastive constraints: pair-level positive sample propagation (PSP), segment-level and video-level positive sample activation (PSA$_S$ and PSA$_V$). Three new contrastive objectives are proposed (\emph{i.e.}, $\mathcal{L}_{\text{avpsp}}$, $\mathcal{L}_\text{spsa}$, and $\mathcal{L}_\text{vpsa}$) and introduced into both the fully and weakly supervised AVE localization. To draw a complete picture of the contrastive learning in AVE localization, we also study the self-supervised positive sample propagation (SSPSP). As a result, CPSP is more helpful to obtain the refined audio-visual features that are distinguishable from the negatives, thus benefiting the classifier prediction. Extensive experiments on the AVE and the newly collected VGGSound-AVEL100k datasets verify the effectiveness and generalization ability of our method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题