SAN-M：端到端语音识别配备记忆的自我注意

论文标题

SAN-M：端到端语音识别配备记忆的自我注意

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

论文作者

Gao, Zhifu, Zhang, Shiliang, Lei, Ming, McLoughlin, Ian

论文摘要

近年来，端到端的语音识别已变得流行，因为它可以将声学，发音和语言模型整合到单个神经网络中。在端到端的方法中，基于注意力的方法已经出现了。例如，Transformer，它采用编码器架构。变压器引入的关键改进是利用自我注意力，而不是复发机制，使编码器和解码器都能以较低的计算复杂性来捕获远程依赖性。在这项工作中，我们建议使用DFSMN存储器块提高自我注意力，从而形成拟议的内存配备了配备了配备的自我注意力（SAN-M）机制。已经进行了理论和经验比较，以证明自我注意力与DFSMN记忆块之间的相关性和互补性。此外，提出的SAN-M提供了一种有效的机制来整合这两个模块。我们已经评估了我们对公共Aishell-1基准测试和工业级别20,000小时的普通话识别任务的方法。在这两个任务上，SAN-M系统的性能都比基于自我注意力的变压器基线系统要好得多。特别是，即使不使用任何外部LM，它也可以在Aishell-1任务上达到6.46％的CER，舒适地表现出其他最先进的系统。

End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of self-attention instead of recurrent mechanisms, enabling both encoder and decoder to capture long-range dependencies with lower computational complexity.In this work, we propose boosting the self-attention ability with a DFSMN memory block, forming the proposed memory equipped self-attention (SAN-M) mechanism. Theoretical and empirical comparisons have been made to demonstrate the relevancy and complementarity between self-attention and the DFSMN memory block. Furthermore, the proposed SAN-M provides an efficient mechanism to integrate these two modules. We have evaluated our approach on the public AISHELL-1 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. On both tasks, SAN-M systems achieved much better performance than the self-attention based Transformer baseline system. Specially, it can achieve a CER of 6.46% on the AISHELL-1 task even without using any external LM, comfortably outperforming other state-of-the-art systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题