论文标题
LA-VOCE:使用神经声码器的低sNR音频语音增强
LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders
论文作者
论文摘要
视听语音增强旨在通过不仅利用音频本身,而且利用目标扬声器的唇部动作来从嘈杂的环境中提取干净的语音。已经显示,这种方法可以改善仅音频语音的增强,尤其是在删除干扰语音方面。尽管语音合成的最新进展,但大多数视听方法仍在继续使用光谱映射/掩盖来重现干净的音频,通常会导致视觉主链添加到现有的语音增强体系结构中。在这项工作中,我们提出了LA-VOCE,这是一种新的两阶段方法,可以通过基于变压器的架构从嘈杂的音频语音中预测MEL光谱图,然后使用Neural Vocoder(Hifi-Gan)将它们转换为波形音频。我们在数千名扬声器和11种不同的语言上训练和评估我们的框架,并研究模型适应不同级别的背景噪声和语音干扰的能力。我们的实验表明,根据多个指标,LA-VOCE优于现有方法,尤其是在非常嘈杂的情况下。
Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use spectral mapping/masking to reproduce the clean audio, often resulting in visual backbones added to existing speech enhancement architectures. In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture, and then converts them into waveform audio using a neural vocoder (HiFi-GAN). We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference. Our experiments show that LA-VocE outperforms existing methods according to multiple metrics, particularly under very noisy scenarios.