COALA：共对准的自动编码器，用于学习语义丰富的音频表示

论文标题

COALA：共对准的自动编码器，用于学习语义丰富的音频表示

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

论文作者

Favory, Xavier, Drossos, Konstantinos, Virtanen, Tuomas, Serra, Xavier

论文摘要

基于深神经网络（DNN）的音频表示学习是手工制作功能的另一种方法。为了实现高性能，DNN通常需要大量的注释数据，这可能很难且昂贵。在本文中，我们提出了一种学习音频表示形式的方法，使学到的音频和相关标签的潜在图表对齐。对齐是通过使用对比损失最大化音频和标签的潜在表示的一致性来完成的。结果是一个音频嵌入模型，它反映了声音的声学和语义特征。我们评估了嵌入模型的质量，测量了其在三个不同任务（即声音事件识别，音乐类型和乐器分类）上作为功能提取器的性能，并研究模型捕获的特征类型。我们的结果是有希望的，有时与所考虑的任务中的最新作品相提并论，而用我们的方法产生的嵌入与某些声学描述符息息相关。

Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题