论文标题
流媒体变压器ASR带有块同步梁搜索
Streaming Transformer ASR with Blockwise Synchronous Beam Search
论文作者
论文摘要
变压器自我发场网络已显示出令人鼓舞的性能,是端到端(E2E)自动语音识别(ASR)系统中复发性神经网络的替代方案。但是,变压器具有一个缺点,因为需要整个输入序列来计算自我注意事项和源头 - 目标的注意。在本文中,我们提出了一种基于编码器的块处理以执行E2E Transformer ASR的块处理,提出了一种新颖的块同步梁搜索算法。在光束搜索中,使用块边界检测技术同步对齐编码的特征块,其中每个预测假设的可靠性得分是根据序列末端评估的,并在假设中重复执行令牌。对HKUST和AISHELL-1普通话,Librispeech English和CSJ日本任务的评估表明,拟议的流媒体变压器算法的表现优于传统的在线方法,包括单调的块状注意(Mocha),尤其是在使用知识蒸馏技术时。一项消融研究表明,我们的流媒体方法有助于减少响应时间,重复标准对某些任务产生了重大贡献。我们的流媒体模型在所有任务中都可以实现与批处理模型和其他基于流的变压器方法的可比性或出色的性能。
The Transformer self-attention network has shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute both self-attention and source--target attention. In this paper, we propose a novel blockwise synchronous beam search algorithm based on blockwise processing of encoder to perform streaming E2E Transformer ASR. In the beam search, encoded feature blocks are synchronously aligned using a block boundary detection technique, where a reliability score of each predicted hypothesis is evaluated based on the end-of-sequence and repeated tokens in the hypothesis. Evaluations of the HKUST and AISHELL-1 Mandarin, LibriSpeech English, and CSJ Japanese tasks show that the proposed streaming Transformer algorithm outperforms conventional online approaches, including monotonic chunkwise attention (MoChA), especially when using the knowledge distillation technique. An ablation study indicates that our streaming approach contributes to reducing the response time, and the repetition criterion contributes significantly in certain tasks. Our streaming ASR models achieve comparable or superior performance to batch models and other streaming-based Transformer methods in all tasks considered.