深层融合RNN-T个性化

论文标题

深层融合RNN-T个性化

Deep Shallow Fusion for RNN-T Personalization

论文作者

Le, Duc, Keren, Gil, Chan, Julian, Mahadeokar, Jay, Fuegen, Christian, Seltzer, Michael L.

论文摘要

尤其是端到端的模型，尤其是经常性的神经网络传感器（RNN-T），由于其简单性，紧凑性和在通用转录任务上的出色性能，在过去几年中，自动语音识别社区在自动语音识别社区中获得了显着的关注。但是，与传统混合系统相比，由于缺乏外部语言模型和识别罕见的长尾词，特别是实体名称，这些模型的个性化更具挑战性。在这项工作中，我们提出了新的技术，以提高RNN-T对稀有文字的建模能力，将额外的信息注入编码器，实现替代的Graphemic发音，并与个性化的语言模型进行深层融合，以使其具有更强的偏见。我们表明，与强大的RNN-T基线相比，这些组合技术可提高15.4％-34.5％的相对单词错误率，该基线使用了浅融合和文本到语音的增强。我们的工作有助于推动RNN-T个性化的边界，并在偏见和实体识别至关重要的情况下与混合系统缩小差距。

End-to-end models in general, and Recurrent Neural Network Transducer (RNN-T) in particular, have gained significant traction in the automatic speech recognition community in the last few years due to their simplicity, compactness, and excellent performance on generic transcription tasks. However, these models are more challenging to personalize compared to traditional hybrid systems due to the lack of external language models and difficulties in recognizing rare long-tail words, specifically entity names. In this work, we present novel techniques to improve RNN-T's ability to model rare WordPieces, infuse extra information into the encoder, enable the use of alternative graphemic pronunciations, and perform deep fusion with personalized language models for more robust biasing. We show that these combined techniques result in 15.4%-34.5% relative Word Error Rate improvement compared to a strong RNN-T baseline which uses shallow fusion and text-to-speech augmentation. Our work helps push the boundary of RNN-T personalization and close the gap with hybrid systems on use cases where biasing and entity recognition are crucial.

下载PDF全文

下载文献需遵守相关版权规定

论文标题