时间域语音超分辨率，具有基于GAN的电话扬声器验证的基于GAN的建模

论文标题

时间域语音超分辨率，具有基于GAN的电话扬声器验证的基于GAN的建模

Time-domain speech super-resolution with GAN based modeling for telephony speaker verification

论文作者

Kataria, Saurabh, Villalba, Jesús, Moro-Velázquez, Laureano, Żelasko, Piotr, Dehak, Najim

论文摘要

自动扬声器验证（ASV）技术在虚拟助手中已变得司空见惯。但是，当火车和测试域之间存在不匹配时，其性能就会受到损失。混合带宽训练，即来自两个域的合并训练数据，是开发适用于窄带和宽带域的通用模型的首选选择。我们提出通过对窄带信号（也称为带宽扩展）进行神经上升采样来补充这种技术。我们的主要目标是发现和分析基于高性能的时间域生成对抗网络（GAN）模型，以改善我们下游的最新ASV系统。我们选择gans，因为它们（1）对于学习有条件的分配功能强大，并且（2）在培训下游任务（ASV）和数据增强期间，允许作为预处理的灵活插件使用。先前的工作主要集中于特征域带宽扩展和有限的实验设置。我们通过1）使用时间域扩展模型来解决这些限制，2）在三个实际测试集上报告结果，2）扩展培训数据，3）设计新的测试时间方案。我们比较受监督的（条件gan）和无监督的甘恩（Cyclean），并显示出同样错误率分别为8.6％和7.7％的平均相对提高。为了进一步分析，我们研究了频谱图视觉质量，音频感知质量，T-SNE嵌入和ASV评分分布的变化。我们表明，我们的带宽扩展会导致现象，例如电话（测试）嵌入向宽带（火车）信号的转移，感知质量与下游性能的负相关以及与条件无关的得分校准。

Automatic Speaker Verification (ASV) technology has become commonplace in virtual assistants. However, its performance suffers when there is a mismatch between the train and test domains. Mixed bandwidth training, i.e., pooling training data from both domains, is a preferred choice for developing a universal model that works for both narrowband and wideband domains. We propose complementing this technique by performing neural upsampling of narrowband signals, also known as bandwidth extension. Our main goal is to discover and analyze high-performing time-domain Generative Adversarial Network (GAN) based models to improve our downstream state-of-the-art ASV system. We choose GANs since they (1) are powerful for learning conditional distribution and (2) allow flexible plug-in usage as a pre-processor during the training of downstream task (ASV) with data augmentation. Prior works mainly focus on feature-domain bandwidth extension and limited experimental setups. We address these limitations by 1) using time-domain extension models, 2) reporting results on three real test sets, 2) extending training data, and 3) devising new test-time schemes. We compare supervised (conditional GAN) and unsupervised GANs (CycleGAN) and demonstrate average relative improvement in Equal Error Rate of 8.6% and 7.7%, respectively. For further analysis, we study changes in spectrogram visual quality, audio perceptual quality, t-SNE embeddings, and ASV score distributions. We show that our bandwidth extension leads to phenomena such as a shift of telephone (test) embeddings towards wideband (train) signals, a negative correlation of perceptual quality with downstream performance, and condition-independent score calibration.

下载PDF全文

下载文献需遵守相关版权规定

论文标题