学习面对面协会的分支融合和正交投影

论文标题

学习面对面协会的分支融合和正交投影

Learning Branched Fusion and Orthogonal Projection for Face-Voice Association

论文作者

Saeed, Muhammad Saad, Nawaz, Shah, Khan, Muhammad Haris, Javed, Sajid, Yousaf, Muhammad Haroon, Del Bue, Alessio

论文摘要

近年来，人们对在YouTube中利用视听信息之间的面孔和名人声音之间建立关联的兴趣增加了。先前的工作采用公制学习方法来学习适合相关匹配和验证任务的嵌入式空间。尽管显示出一些进展，但由于依赖距离依赖的余量参数，运行时训练的复杂性差以及依赖精心制作的负面采矿程序的依赖，因此这是限制性的。在这项工作中，我们假设一个丰富的表示形式以及有效但有效的监督对于实现面部voice关联任务的歧视性关节嵌入空间很重要。为此，我们提出了一种轻巧的插件机制，该机制利用两种方式中的互补线索，以通过正交性约束来根据其身份标签形成丰富的融合杂物并簇。我们将我们提出的机制作为融合和正交投影（FOP）创建，并在两个流网络中实例化。在Voxceleb1和Mav-Celeb数据集上评估了总体结果框架，其中包括许多任务，包括跨模式验证和匹配。结果表明，我们的方法对当前的最新方法有利，而我们提出的监督表述比当代方法所采用的方法更有效，更有效。此外，我们利用跨模式验证和匹配任务来分析多种语言对面部voice协会的影响。代码可用：\ url {https://github.com/msaadsaeed/fop}

Recent years have seen an increased interest in establishing association between faces and voices of celebrities leveraging audio-visual information from YouTube. Prior works adopt metric learning methods to learn an embedding space that is amenable for associated matching and verification tasks. Albeit showing some progress, such formulations are, however, restrictive due to dependency on distance-dependent margin parameter, poor run-time training complexity, and reliance on carefully crafted negative mining procedures. In this work, we hypothesize that an enriched representation coupled with an effective yet efficient supervision is important towards realizing a discriminative joint embedding space for face-voice association tasks. To this end, we propose a light-weight, plug-and-play mechanism that exploits the complementary cues in both modalities to form enriched fused embeddings and clusters them based on their identity labels via orthogonality constraints. We coin our proposed mechanism as fusion and orthogonal projection (FOP) and instantiate in a two-stream network. The overall resulting framework is evaluated on VoxCeleb1 and MAV-Celeb datasets with a multitude of tasks, including cross-modal verification and matching. Results reveal that our method performs favourably against the current state-of-the-art methods and our proposed formulation of supervision is more effective and efficient than the ones employed by the contemporary methods. In addition, we leverage cross-modal verification and matching tasks to analyze the impact of multiple languages on face-voice association. Code is available: \url{https://github.com/msaadsaeed/FOP}

下载PDF全文

下载文献需遵守相关版权规定

论文标题