论文标题
对比性环境声音表示学习
Contrastive Environmental Sound Representation Learning
论文作者
论文摘要
机器听取环境声音是音频识别域中的重要问题之一。它使机器可以区分指导决策的不同输入声音。在这项工作中,我们利用自我监管的对比技术和浅1D CNN来提取独特的音频功能(音频表示),而无需使用任何明确的注释。我们使用其原始音频波形和光谱图生成给定音频的表示形式,并在提议的学习者对音频构成的音频构成类型的类型上进行评估。我们进一步使用典型相关分析(CCA)来融合给定音频的两种输入类型的表示,并证明融合的全局特征与单个表示相比会导致音频信号的强大表示。对拟议技术的评估均在ESC-50和URBANSOUND8K上进行。结果表明,该提出的技术能够提取环境音频的大多数功能,并在ESC-50和URBANSOUND8K数据集中提高了12.8%和0.9%。
Machine hearing of the environmental sound is one of the important issues in the audio recognition domain. It gives the machine the ability to discriminate between the different input sounds that guides its decision making. In this work we exploit the self-supervised contrastive technique and a shallow 1D CNN to extract the distinctive audio features (audio representations) without using any explicit annotations.We generate representations of a given audio using both its raw audio waveform and spectrogram and evaluate if the proposed learner is agnostic to the type of audio input. We further use canonical correlation analysis (CCA) to fuse representations from the two types of input of a given audio and demonstrate that the fused global feature results in robust representation of the audio signal as compared to the individual representations. The evaluation of the proposed technique is done on both ESC-50 and UrbanSound8K. The results show that the proposed technique is able to extract most features of the environmental audio and gives an improvement of 12.8% and 0.9% on the ESC-50 and UrbanSound8K datasets respectively.