论文标题
定向ASR:E2E多演讲者语音识别的新范式,并具有来源定位
Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization
论文作者
论文摘要
本文提出了一种新的范式,用于以端到端的神经网络方式处理远场多演讲者数据,称为“定向自动语音识别”(D-ASR),该数据明确对源扬声器位置进行了明确的建模。在D-ASR中,相对于麦克风阵列的源的方位角定义为潜在变量。该角度控制着分离的质量,从而决定了ASR性能。 D-ASR的所有三个功能:本地化,分离和识别均作为单个可区分的神经网络连接,并仅基于ASR误差最小化目标而受过训练。 D-ASR比现有方法的优点是三倍:(1)它提供了明确的扬声器位置,(2)它提高了解释性因子,并且(3)随着该过程的流程更加简化,它可以提高ASR性能。此外,D-ASR不需要像现有数据驱动的本地化模型那样明确的到达方向(DOA)监督,这使其更适合于现实的数据。对于两种源混合物,D-ASR的平均DOA预测误差小于三度。在分离质量和ASR性能方面,它还优于强大的远场多演讲者端到端系统。
This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR), which explicitly models source speaker locations. In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance. All three functionalities of D-ASR: localization, separation, and recognition are connected as a single differentiable neural network and trained solely based on ASR error minimization objectives. The advantages of D-ASR over existing methods are threefold: (1) it provides explicit speaker locations, (2) it improves the explainability factor, and (3) it achieves better ASR performance as the process is more streamlined. In addition, D-ASR does not require explicit direction of arrival (DOA) supervision like existing data-driven localization models, which makes it more appropriate for realistic data. For the case of two source mixtures, D-ASR achieves an average DOA prediction error of less than three degrees. It also outperforms a strong far-field multi-speaker end-to-end system in both separation quality and ASR performance.