使用深神经网络增强一致性的多通道语音

论文标题

使用深神经网络增强一致性的多通道语音

Consistency-aware multi-channel speech enhancement using deep neural networks

论文作者

Masuyama, Yoshiki, Togami, Masahito, Komatsu, Tatsuya

论文摘要

本文提出了一个基于深神经网络（DNN）的多通道语音增强系统，其中训练DNN以最大程度地提高增强的时间域信号的质量。基于DNN的多通道语音增强通常是在时间频率（T-F）域进行的，因为可以在T-F-F域中有效实现空间过滤。在这种情况下，在估计的t-f面膜或光谱图上计算了普通目标函数。但是，估计的频谱图通常是不一致的，当频谱图转换回时域时，其幅度和相可能会发生变化。也就是说，目标函数无法正确评估增强的时间域信号。为了解决此问题，我们建议使用在重建的时间域信号上定义的目标函数。具体而言，语音增强是通过T-F域中的多通道Wiener滤波进行的，其结果将转换回时域。我们提出了在重构信号上计算出的两个目标函数，其中第一个函数在时间域中定义，另一个目标函数在T-F域中定义。我们的实验证明了所提出的系统的有效性，与T-F屏蔽和基于掩模的光束形成相比。

This paper proposes a deep neural network (DNN)-based multi-channel speech enhancement system in which a DNN is trained to maximize the quality of the enhanced time-domain signal. DNN-based multi-channel speech enhancement is often conducted in the time-frequency (T-F) domain because spatial filtering can be efficiently implemented in the T-F domain. In such a case, ordinary objective functions are computed on the estimated T-F mask or spectrogram. However, the estimated spectrogram is often inconsistent, and its amplitude and phase may change when the spectrogram is converted back to the time-domain. That is, the objective function does not evaluate the enhanced time-domain signal properly. To address this problem, we propose to use an objective function defined on the reconstructed time-domain signal. Specifically, speech enhancement is conducted by multi-channel Wiener filtering in the T-F domain, and its result is converted back to the time-domain. We propose two objective functions computed on the reconstructed signal where the first one is defined in the time-domain, and the other one is defined in the T-F domain. Our experiment demonstrates the effectiveness of the proposed system comparing to T-F masking and mask-based beamforming.

下载PDF全文

下载文献需遵守相关版权规定

论文标题