切片机：使用低资源的自我监督预训练学习通用音频表示

论文标题

切片机：使用低资源的自我监督预训练学习通用音频表示

SLICER: Learning universal audio representations using low-resource self-supervised pre-training

论文作者

Seth, Ashish, Ghosh, Sreyan, Umesh, S., Manocha, Dinesh

论文摘要

我们提出了一种新的自我监督学习（SSL）方法，以预先培训编码未标记的音频数据，以减少对大量标记数据进行音频和语音分类的需求。我们的主要目的是学习可以在低资源未标记的音频预训练设置中跨越各种语音和非语音任务的音频表示形式。受到基于SSL的语音表示学习的聚类和对比的学习范式的最新成功的启发，我们提出了SliCer（对实例和集群级别有效表示的对称学习），这将聚集和对比的学习范式中最好的。我们在学生和教师编码器的潜在表示之间使用对称损失，并同时解决实例和集群级的对比度学习任务。我们只需将输入频谱图投影到等于簇数量的尺寸的输出子空间中，从而在线获得群集表示。此外，我们提出了一种基于混合的新型MEL-SPECTROGRAM扩展程序K-Mix，该过程不需要标签和辅助无监督的代表性学习音频。总体而言，Slicer在Lape基准\ Cite {9868132}上实现了最新的结果，其表现明显胜过Delores-M和其他先前的方法，这些方法已在$ 10 \ times $ times $ after times $ after times $ after times $ after times $ after中。我们将在Github上提供所有代码。

We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks in a low-resource un-labeled audio pre-training setting. Inspired by the recent success of clustering and contrasting learning paradigms for SSL-based speech representation learning, we propose SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations), which brings together the best of both clustering and contrasting learning paradigms. We use a symmetric loss between latent representations from student and teacher encoders and simultaneously solve instance and cluster-level contrastive learning tasks. We obtain cluster representations online by just projecting the input spectrogram into an output subspace with dimensions equal to the number of clusters. In addition, we propose a novel mel-spectrogram augmentation procedure, k-mix, based on mixup, which does not require labels and aids unsupervised representation learning for audio. Overall, SLICER achieves state-of-the-art results on the LAPE Benchmark \cite{9868132}, significantly outperforming DeLoRes-M and other prior approaches, which are pre-trained on $10\times$ larger of unsupervised data. We will make all our codes available on GitHub.

下载PDF全文

下载文献需遵守相关版权规定

论文标题