流式序列到序列ASR的最低延迟培训策略

论文标题

流式序列到序列ASR的最低延迟培训策略

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

论文作者

Inaguma, Hirofumi, Gaur, Yashesh, Lu, Liang, Li, Jinyu, Gong, Yifan

论文摘要

最近，已经提出了一些新型基于注意的基于注意的序列到序列（S2S）模型，以通过线性时间解码复杂性执行在线语音识别。但是，在这些模型中，与实际的声学边界相比，生成代币的决定延迟了，因为它们的单向编码者缺乏未来的信息。这导致推理期间不可避免的延迟。为了减轻此问题并减少潜伏期，我们通过利用从混合模型中提取的外部硬对准来提出培训期间的几种策略。我们调查以利用编码器和解码器中的对齐方式。在编码器方面，（1）多任务学习和（2）使用框架分类任务进行预训练。在解码器侧，我们（3）在对齐边缘化过程中删除了不适当的对齐路径，而不是可接受的延迟，（4）直接最大程度地减少了可区分的预期延迟损失。在Cortana语音搜索任务上的实验表明，我们提出的方法可以显着降低潜伏期，甚至可以提高解码器一侧的某些情况下的识别精度。我们还提出了一些分析，以了解流S2S模型的行为。

Recently, a few novel streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity. However, in these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information. This leads to an inevitable latency during inference. To alleviate this issue and reduce latency, we propose several strategies during training by leveraging external hard alignments extracted from the hybrid model. We investigate to utilize the alignments in both the encoder and the decoder. On the encoder side, (1) multi-task learning and (2) pre-training with the framewise classification task are studied. On the decoder side, we (3) remove inappropriate alignment paths beyond an acceptable latency during the alignment marginalization, and (4) directly minimize the differentiable expected latency loss. Experiments on the Cortana voice search task demonstrate that our proposed methods can significantly reduce the latency, and even improve the recognition accuracy in certain cases on the decoder side. We also present some analysis to understand the behaviors of streaming S2S models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题