论文标题
SPEX:多尺度时间域扬声器提取网络
SpEx: Multi-Scale Time Domain Speaker Extraction Network
论文作者
论文摘要
扬声器提取的目的是通过从多对言语的环境中提取目标扬声器的声音来模仿人类的选择性听觉关注。通常在频域中进行提取,并从提取的幅度和估计的相光谱中重建时间域信号。但是,这种方法受到固有的相位估计难度的不利影响。受到Conv-Tasnet的启发,我们提出了一个时域扬声器提取网络(SPEX),该网络将混合语音转换为多尺度嵌入系数,而不是将语音信号分解为大小和相光谱。这样,我们避免了阶段估计。 SPEX网络由四个网络组件组成,即扬声器编码器,语音编码器,扬声器提取器和语音解码器。具体而言,语音编码器将混合语音转换为多尺度嵌入系数,扬声器编码器学会用扬声器嵌入代表目标扬声器。扬声器提取器将多尺度嵌入系数和目标扬声器嵌入为输入并估计一个接受性掩模。最后,语音解码器从掩盖的嵌入系数中重建了目标扬声器的语音。我们还建议一个多任务学习框架和多尺度嵌入式实现。实验结果表明,在开放评估条件下,拟议的SPEX在信噪比(SDR),量表不变的SDR(SI-SDR)和言语评估(PESQ)方面,相对于最佳基线的相对基线的相对改善(SDR)比最佳基线的相对改善(PESQ)在开放式评估条件下,相对改善(PESQ)的相对改善(PESQ),相对改善(PESQ)的相对改善,相对改善比最佳基线相对改善,相对改善(PESQ)比最佳基线相对改善。
Speaker extraction aims to mimic humans' selective auditory attention by extracting a target speaker's voice from a multi-talker environment. It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra. However, such an approach is adversely affected by the inherent difficulty of phase estimation. Inspired by Conv-TasNet, we propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra. In this way, we avoid phase estimation. The SpEx network consists of four network components, namely speaker encoder, speech encoder, speaker extractor, and speech decoder. Specifically, the speech encoder converts the mixture speech into multi-scale embedding coefficients, the speaker encoder learns to represent the target speaker with a speaker embedding. The speaker extractor takes the multi-scale embedding coefficients and target speaker embedding as input and estimates a receptive mask. Finally, the speech decoder reconstructs the target speaker's speech from the masked embedding coefficients. We also propose a multi-task learning framework and a multi-scale embedding implementation. Experimental results show that the proposed SpEx achieves 37.3%, 37.7% and 15.0% relative improvements over the best baseline in terms of signal-to-distortion ratio (SDR), scale-invariant SDR (SI-SDR), and perceptual evaluation of speech quality (PESQ) under an open evaluation condition.