单语言和多对话者演讲的多渠道扬声器验证

论文标题

单语言和多对话者演讲的多渠道扬声器验证

Multi-Channel Speaker Verification for Single and Multi-talker Speech

论文作者

Kataria, Saurabh, Zhang, Shi-Xiong, Yu, Dong

论文摘要

为了通过干扰扬声器，噪音和混响来改善实际场景中的扬声器验证，我们建议将多频道语音功能的进步汇总在一起。具体而言，我们结合了光谱，空间和方向性特征，其中包括通道间相位差，多频道SINC卷积，方向功率比特征和角度特征。为了最大程度地利用受监督的学习，我们的框架还配备了多通道语音增强和语音活动检测。在所有模拟，重播和真实的记录中，我们都会观察到各种退化水平的大量且一致的改进。在多对言语演讲的真实记录中，我们在同样的错误率W.R.T.中实现了36％的相对降低。单通道基线。我们发现，在多样性条件下，与说话者有关的方向性特征的改进比清洁更一致。最后，我们调查是否可以通过基于对比的微型调整来使学到的多通道扬声器嵌入空间更具歧视性。有了简单的三胞胎损失选择，我们观察到EER的相对减少了8.3％。

To improve speaker verification in real scenarios with interference speakers, noise, and reverberation, we propose to bring together advancements made in multi-channel speech features. Specifically, we combine spectral, spatial, and directional features, which includes inter-channel phase difference, multi-channel sinc convolutions, directional power ratio features, and angle features. To maximally leverage supervised learning, our framework is also equipped with multi-channel speech enhancement and voice activity detection. On all simulated, replayed, and real recordings, we observe large and consistent improvements at various degradation levels. On real recordings of multi-talker speech, we achieve a 36% relative reduction in equal error rate w.r.t. single-channel baseline. We find the improvements from speaker-dependent directional features more consistent in multi-talker conditions than clean. Lastly, we investigate if the learned multi-channel speaker embedding space can be made more discriminative through a contrastive loss-based fine-tuning. With a simple choice of Triplet loss, we observe a further 8.3% relative reduction in EER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题