带有增强记忆的基于注意力的模型以端到端语音识别

论文标题

带有增强记忆的基于注意力的模型以端到端语音识别

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

论文作者

Yeh, Ching-Feng, Wang, Yongqiang, Shi, Yangyang, Wu, Chunyang, Zhang, Frank, Chan, Julian, Seltzer, Michael L.

论文摘要

基于注意力的模型最近因在机器翻译和自动语音识别等领域表现出的强劲表现而越来越受欢迎。基于注意力的模型的一个主要挑战是需要访问完整序列以及有关序列长度的四次增长的计算成本。这些特征构成了挑战，尤其是对于低延迟方案，通常要求系统进行流式传输。在本文中，我们在端到端的神经传感器体系结构上建立了一个紧凑而流媒体的语音识别系统，其基于注意力的模块随着卷积增强。提出的系统将端到端模型配备了流式功能，并使用增强内存从基于流动注意的模型中降低了较大的足迹。在LibrisPeech数据集上，我们提出的系统在测试清洁的情况下达到了2.7％的单词错误率，而在测试中则达到了5.8％，据我们所知，在迄今为止报告的流媒体方法中最低。

Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition. One major challenge of attention-based models is the need of access to the full sequence and the quadratically growing computational cost concerning the sequence length. These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming. In this paper, we build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution. The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory. On the LibriSpeech dataset, our proposed system achieves word error rates 2.7% on test-clean and 5.8% on test-other, to our best knowledge the lowest among streaming approaches reported so far.

下载PDF全文

下载文献需遵守相关版权规定

论文标题