论文标题
使用编码器状态修订策略改善基于变压器的因果模型的端到端ASR
Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies
论文作者
论文摘要
流式传输自动语音识别(ASR)的性能与延迟之间通常存在权衡。传统方法(例如基于批量和基于块的方法)通常需要未来框架中的信息来提高识别精度,即使计算足够快,也会不可避免地潜伏期。计算不将来的未来框架的因果模型可以避免这种延迟,但其性能比传统方法差得多。在本文中,我们提出了相应的修订策略来改善因果模型。首先,我们介绍了一个实时编码器修订策略以修改以前的状态。一旦收到数据并在几个帧后修改了先前的编码器状态后,编码器向前计算开始,这无需等待任何正确的上下文。此外,CTC SPIKE位置对准解码算法旨在减少修订策略带来的时间成本。实验均在Librispeech数据集上进行。在基于CTC的WAV2VEC2.0模型上进行微调,我们的最佳方法可以在测试清洁/其他集合上实现3.7/9.2 WER,这也与基于块的方法和知识蒸馏方法竞争。
There is often a trade-off between performance and latency in streaming automatic speech recognition (ASR). Traditional methods such as look-ahead and chunk-based methods, usually require information from future frames to advance recognition accuracy, which incurs inevitable latency even if the computation is fast enough. A causal model that computes without any future frames can avoid this latency, but its performance is significantly worse than traditional methods. In this paper, we propose corresponding revision strategies to improve the causal model. Firstly, we introduce a real-time encoder states revision strategy to modify previous states. Encoder forward computation starts once the data is received and revises the previous encoder states after several frames, which is no need to wait for any right context. Furthermore, a CTC spike position alignment decoding algorithm is designed to reduce time costs brought by the revision strategy. Experiments are all conducted on Librispeech datasets. Fine-tuning on the CTC-based wav2vec2.0 model, our best method can achieve 3.7/9.2 WERs on test-clean/other sets, which is also competitive with the chunk-based methods and the knowledge distillation methods.