论文标题
使用深神经网络增强一致性的多通道语音
Consistency-aware multi-channel speech enhancement using deep neural networks
论文作者
论文摘要
本文提出了一个基于深神经网络(DNN)的多通道语音增强系统,其中训练DNN以最大程度地提高增强的时间域信号的质量。基于DNN的多通道语音增强通常是在时间频率(T-F)域进行的,因为可以在T-F-F域中有效实现空间过滤。在这种情况下,在估计的t-f面膜或光谱图上计算了普通目标函数。但是,估计的频谱图通常是不一致的,当频谱图转换回时域时,其幅度和相可能会发生变化。也就是说,目标函数无法正确评估增强的时间域信号。为了解决此问题,我们建议使用在重建的时间域信号上定义的目标函数。具体而言,语音增强是通过T-F域中的多通道Wiener滤波进行的,其结果将转换回时域。我们提出了在重构信号上计算出的两个目标函数,其中第一个函数在时间域中定义,另一个目标函数在T-F域中定义。我们的实验证明了所提出的系统的有效性,与T-F屏蔽和基于掩模的光束形成相比。
This paper proposes a deep neural network (DNN)-based multi-channel speech enhancement system in which a DNN is trained to maximize the quality of the enhanced time-domain signal. DNN-based multi-channel speech enhancement is often conducted in the time-frequency (T-F) domain because spatial filtering can be efficiently implemented in the T-F domain. In such a case, ordinary objective functions are computed on the estimated T-F mask or spectrogram. However, the estimated spectrogram is often inconsistent, and its amplitude and phase may change when the spectrogram is converted back to the time-domain. That is, the objective function does not evaluate the enhanced time-domain signal properly. To address this problem, we propose to use an objective function defined on the reconstructed time-domain signal. Specifically, speech enhancement is conducted by multi-channel Wiener filtering in the T-F domain, and its result is converted back to the time-domain. We propose two objective functions computed on the reconstructed signal where the first one is defined in the time-domain, and the other one is defined in the T-F domain. Our experiment demonstrates the effectiveness of the proposed system comparing to T-F masking and mask-based beamforming.