学习扬声器以动量对比嵌入

论文标题

学习扬声器以动量对比嵌入

Learning Speaker Embedding with Momentum Contrast

论文作者

Ding, Ke, He, Xuanji, Wan, Guanglu

论文摘要

说话者验证可以作为表示任务的代表性学习任务，其中说话者歧视性嵌入是从可变长度的话语中提取的。动量对比（MOCO）是最近提出的无监督表示学习框架，并显示了其为下游视觉任务学习良好功能表示的有效性。在这项工作中，我们使用Moco学习语音段中嵌入的扬声器。我们探索了无监督学习和预训练环境的MOCO。在无监督的情况下，MOCO从音频数据中学习了嵌入，而无需使用任何说话者的特定信息。在具有$ 2,500 $扬声器的大型数据集中，Moco不受监禁的培训可以实现EER $ 4.275 \％$，如果使用了额外的未标记数据，则EER可以进一步降低至$ 3.58 \％。在预训练的情况下，由MOCO培训的编码器用于初始化下游监督培训。随着MOCO培训模型的填充，与从Scratch进行了精心调整的基线培训相比，相等的错误率（EER）相对$ 13.7 \％$ $相对（$ 1.44 \％\％$）。比较研究证实了Moco学习好的扬声器嵌入的有效性。

Speaker verification can be formulated as a representation learning task, where speaker-discriminative embeddings are extracted from utterances of variable lengths. Momentum Contrast (MoCo) is a recently proposed unsupervised representation learning framework, and has shown its effectiveness for learning good feature representation for downstream vision tasks. In this work, we apply MoCo to learn speaker embedding from speech segments. We explore MoCo for both unsupervised learning and pretraining settings. In the unsupervised scenario, embedding is learned by MoCo from audio data without using any speaker specific information. On a large scale dataset with $2,500$ speakers, MoCo can achieve EER $4.275\%$ trained unsupervisedly, and the EER can decrease further to $3.58\%$ if extra unlabelled data are used. In the pretraining scenario, encoder trained by MoCo is used to initialize the downstream supervised training. With finetuning on the MoCo trained model, the equal error rate (EER) reduces $13.7\%$ relative ($1.44\%$ to $1.242\%$) compared to a carefully tuned baseline training from scratch. Comparative study confirms the effectiveness of MoCo learning good speaker embedding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题