序列传感器的最小延迟训练端到端语音识别

论文标题

序列传感器的最小延迟训练端到端语音识别

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

论文作者

Shinohara, Yusuke, Watanabe, Shinji

论文摘要

序列传感器（例如RNN-T和构象-T）是端到端语音识别的最有前途的模型之一，尤其是在流媒体场景中，延迟和准确性都很重要。尽管已经研究了各种方法，例如限制对齐的训练和快速训练，以减少潜伏期，但延迟降低通常伴随着准确性的显着降解。我们认为，这种次优性能可能是由于先前的方法明确模型并减少延迟而引起的。在本文中，我们提出了一种新的培训方法，以明确模拟和减少序列传感器模型的延迟。首先，我们定义了晶格上每条对角线线的预期潜伏期，并表明其梯度可以在前向后退算法中有效计算。然后，我们通过这种预期的潜伏期来增加传感器损失，从而实现延迟和准确性之间的最佳权衡。 WSJ数据集的实验结果表明，提出的最小潜伏期训练在0.7％的降解中将因果构象-T的潜伏期从220毫秒降低到27毫秒，并且胜过常规的对准限制训练（110 ms）和FastEmit（67 ms）方法。

Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods, such as alignment-restricted training and FastEmit, have been studied to reduce the latency, latency reduction is often accompanied with a significant degradation in accuracy. We argue that this suboptimal performance might be caused because none of the prior methods explicitly model and reduce the latency. In this paper, we propose a new training method to explicitly model and reduce the latency of sequence transducer models. First, we define the expected latency at each diagonal line on the lattice, and show that its gradient can be computed efficiently within the forward-backward algorithm. Then we augment the transducer loss with this expected latency, so that an optimal trade-off between latency and accuracy is achieved. Experimental results on the WSJ dataset show that the proposed minimum latency training reduces the latency of causal Conformer-T from 220 ms to 27 ms within a WER degradation of 0.7%, and outperforms conventional alignment-restricted training (110 ms) and FastEmit (67 ms) methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题