LA-VOCE：使用神经声码器的低sNR音频语音增强

论文标题

LA-VOCE：使用神经声码器的低sNR音频语音增强

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

论文作者

Mira, Rodrigo, Xu, Buye, Donley, Jacob, Kumar, Anurag, Petridis, Stavros, Ithapu, Vamsi Krishna, Pantic, Maja

论文摘要

视听语音增强旨在通过不仅利用音频本身，而且利用目标扬声器的唇部动作来从嘈杂的环境中提取干净的语音。已经显示，这种方法可以改善仅音频语音的增强，尤其是在删除干扰语音方面。尽管语音合成的最新进展，但大多数视听方法仍在继续使用光谱映射/掩盖来重现干净的音频，通常会导致视觉主链添加到现有的语音增强体系结构中。在这项工作中，我们提出了LA-VOCE，这是一种新的两阶段方法，可以通过基于变压器的架构从嘈杂的音频语音中预测MEL光谱图，然后使用Neural Vocoder（Hifi-Gan）将它们转换为波形音频。我们在数千名扬声器和11种不同的语言上训练和评估我们的框架，并研究模型适应不同级别的背景噪声和语音干扰的能力。我们的实验表明，根据多个指标，LA-VOCE优于现有方法，尤其是在非常嘈杂的情况下。

Audio-visual speech enhancement aims to extract clean speech from a noisy environment by leveraging not only the audio itself but also the target speaker's lip movements. This approach has been shown to yield improvements over audio-only speech enhancement, particularly for the removal of interfering speech. Despite recent advances in speech synthesis, most audio-visual approaches continue to use spectral mapping/masking to reproduce the clean audio, often resulting in visual backbones added to existing speech enhancement architectures. In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture, and then converts them into waveform audio using a neural vocoder (HiFi-GAN). We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference. Our experiments show that LA-VocE outperforms existing methods according to multiple metrics, particularly under very noisy scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题