使用VAE-GAN的对抗数据扩展用于语音识别无序

论文标题

使用VAE-GAN的对抗数据扩展用于语音识别无序

Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

论文作者

Jin, Zengrui, Xie, Xurong, Geng, Mengzhe, Wang, Tianzi, Hu, Shujie, Deng, Jiajun, Li, Guinan, Liu, Xunying

论文摘要

迄今为止，自动认识言语无序仍然是一项高度挑战的任务。基本的神经运动状况通常与共同发生的身体残疾相复杂，这使得很难收集大量ASR系统开发所需的言语受损。本文介绍了基于个性化无序的语音增强方法，同时学习编码，生成和区分综合综合语音受损的语音。得出单独的潜在特征以学习违反语音语音特征和音素上下文表示。还合并了自我监督的预训练的Pread Wav2Vec 2.0嵌入功能。在Uapeech语料库上进行的实验表明，提出的对抗数据增强方法始终超过了基线速度扰动，并使用训练有素的混合型TDNN和端到端构象象形符构象系统的基线速度扰动和非VAE GAN GAN增强方法。 LHUC扬声器改编后，使用基于VAE-GAN的增强作用的最佳系统在16个屈服扬声器的Uaspeech测试集中产生了27.78％的总体，并且在扬声器的子集中发表的57.31％最低的WER具有“非常低”的可智能。

Automatic recognition of disordered speech remains a highly challenging task to date. The underlying neuro-motor conditions, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of impaired speech required for ASR system development. This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthesized impaired speech. Separate latent features are derived to learn dysarthric speech characteristics and phoneme context representations. Self-supervised pre-trained Wav2vec 2.0 embedding features are also incorporated. Experiments conducted on the UASpeech corpus suggest the proposed adversarial data augmentation approach consistently outperformed the baseline speed perturbation and non-VAE GAN augmentation methods with trained hybrid TDNN and End-to-end Conformer systems. After LHUC speaker adaptation, the best system using VAE-GAN based augmentation produced an overall WER of 27.78% on the UASpeech test set of 16 dysarthric speakers, and the lowest published WER of 57.31% on the subset of speakers with "Very Low" intelligibility.

下载PDF全文

下载文献需遵守相关版权规定

论文标题