通过融合预训练模型的多层特征来组成一般音频表示

论文标题

通过融合预训练模型的多层特征来组成一般音频表示

Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model

论文作者

Niizumi, Daisuke, Takeuchi, Daiki, Ohishi, Yasunori, Harada, Noboru, Kashino, Kunio

论文摘要

许多应用研究依赖于在大规模数据集上预先训练的音频DNN模型作为基本功能提取器，并且它们从最后一层提取特征。在这项研究中，我们专注于发现现有监督预训练模型的中层特征比某些任务的后期特征更有效。我们提出了一种简单的方法来构成对通用应用有效的特征，该功能由两个步骤组成：（1）从中/晚层输出中计算特征向量，以及（2）将它们融合。这种方法改善了下游过程中频率和通道信息的效用，并结合了中间和晚层特征在不同任务中的有效性。结果，特征向量对于一般目的而有效。在使用VGGISH，PANNS'CNN14和AST的实验中，在九个下游任务上，我们首先表明这些模型的每个层输出都提供不同的任务。然后，我们证明了所提出的方法可显着提高其性能，并将其提高到与最先进的水平相当的水平。特别是，非语义语音（NOSS）任务的表现大大提高，尤其是在VGGISH +77.1的语音命令V2上（14.3％至91.4％）。

Many application studies rely on audio DNN models pre-trained on a large-scale dataset as essential feature extractors, and they extract features from the last layers. In this study, we focus on our finding that the middle layer features of existing supervised pre-trained models are more effective than the late layer features for some tasks. We propose a simple approach to compose features effective for general-purpose applications, consisting of two steps: (1) calculating feature vectors along the time frame from middle/late layer outputs, and (2) fusing them. This approach improves the utility of frequency and channel information in downstream processes, and combines the effectiveness of middle and late layer features for different tasks. As a result, the feature vectors become effective for general purposes. In the experiments using VGGish, PANNs' CNN14, and AST on nine downstream tasks, we first show that each layer output of these models serves different tasks. Then, we demonstrate that the proposed approach significantly improves their performance and brings it to a level comparable to that of the state-of-the-art. In particular, the performance of the non-semantic speech (NOSS) tasks greatly improves, especially on Speech commands V2 with VGGish of +77.1 (14.3% to 91.4%).

下载PDF全文

下载文献需遵守相关版权规定

论文标题