改进的单膜混合物的源计数和分离

论文标题

改进的单膜混合物的源计数和分离

Improved Source Counting and Separation for Monaural Mixture

论文作者

Xiao, Yiming, Zhang, Haijian

论文摘要

在过去的几年中，已广泛研究了时域和频域中的单渠道语音分离。但是，大多数以前的作品都会提前假设已知的扬声器数量，但是在实践中，通过单声道混合物不容易访问。在本文中，我们通过共同学习时频功能和未知数量的扬声器，提出了一种单渠道多演讲者分离的新型模型。具体而言，我们的模型集成了时间域卷积编码的特征图和频率域频谱图，并通过注意机制进行了频域光谱图，并将集成的特征投影到高维嵌入向量中，然后将其与深度吸引子网络聚集以修改编码的特征。同时，通过计算嵌入向量的Gerschgorin磁盘来计算说话者的数量，这些载体是对不同扬声器的正交的。最后，使用线性解码器将修改的编码功能倒入声波。网格数据集上的实验评估表明，具有单个模型的提议方法可以准确估计成功概率为96.7％的扬声器数量，同时以量表不变的信号与涉及信号比率（SI-SNRI）（SI-SNRI）（SI-SNRI）和信号及时的改善（SDRI）（SIR-SNRI）（SID-SDRI）（SDRI）（SID-SDRI）（SID-SDRI）实现了最新的分离结果。

Single-channel speech separation in time domain and frequency domain has been widely studied for voice-driven applications over the past few years. Most of previous works assume known number of speakers in advance, however, which is not easily accessible through monaural mixture in practice. In this paper, we propose a novel model of single-channel multi-speaker separation by jointly learning the time-frequency feature and the unknown number of speakers. Specifically, our model integrates the time-domain convolution encoded feature map and the frequency-domain spectrogram by attention mechanism, and the integrated features are projected into high-dimensional embedding vectors which are then clustered with deep attractor network to modify the encoded feature. Meanwhile, the number of speakers is counted by computing the Gerschgorin disks of the embedding vectors which are orthogonal for different speakers. Finally, the modified encoded feature is inverted to the sound waveform using a linear decoder. Experimental evaluation on the GRID dataset shows that the proposed method with a single model can accurately estimate the number of speakers with 96.7 % probability of success, while achieving the state-of-the-art separation results on multi-speaker mixtures in terms of scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-distortion ratio improvement (SDRi).

下载PDF全文

下载文献需遵守相关版权规定

论文标题