定向ASR：E2E多演讲者语音识别的新范式，并具有来源定位

论文标题

定向ASR：E2E多演讲者语音识别的新范式，并具有来源定位

Directional ASR: A New Paradigm for E2E Multi-Speaker Speech Recognition with Source Localization

论文作者

Subramanian, Aswin Shanmugam, Weng, Chao, Watanabe, Shinji, Yu, Meng, Xu, Yong, Zhang, Shi-Xiong, Yu, Dong

论文摘要

本文提出了一种新的范式，用于以端到端的神经网络方式处理远场多演讲者数据，称为“定向自动语音识别”（D-ASR），该数据明确对源扬声器位置进行了明确的建模。在D-ASR中，相对于麦克风阵列的源的方位角定义为潜在变量。该角度控制着分离的质量，从而决定了ASR性能。 D-ASR的所有三个功能：本地化，分离和识别均作为单个可区分的神经网络连接，并仅基于ASR误差最小化目标而受过训练。 D-ASR比现有方法的优点是三倍：（1）它提供了明确的扬声器位置，（2）它提高了解释性因子，并且（3）随着该过程的流程更加简化，它可以提高ASR性能。此外，D-ASR不需要像现有数据驱动的本地化模型那样明确的到达方向（DOA）监督，这使其更适合于现实的数据。对于两种源混合物，D-ASR的平均DOA预测误差小于三度。在分离质量和ASR性能方面，它还优于强大的远场多演讲者端到端系统。

This paper proposes a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional automatic speech recognition (D-ASR), which explicitly models source speaker locations. In D-ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance. All three functionalities of D-ASR: localization, separation, and recognition are connected as a single differentiable neural network and trained solely based on ASR error minimization objectives. The advantages of D-ASR over existing methods are threefold: (1) it provides explicit speaker locations, (2) it improves the explainability factor, and (3) it achieves better ASR performance as the process is more streamlined. In addition, D-ASR does not require explicit direction of arrival (DOA) supervision like existing data-driven localization models, which makes it more appropriate for realistic data. For the case of two source mixtures, D-ASR achieves an average DOA prediction error of less than three degrees. It also outperforms a strong far-field multi-speaker end-to-end system in both separation quality and ASR performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题