基于语音文本的多模式培训，双向关注以改善语音识别

论文标题

基于语音文本的多模式培训，双向关注以改善语音识别

Speech-text based multi-modal training with bidirectional attention for improved speech recognition

论文作者

Yang, Yuhang, Xu, Haihua, Huang, Hao, Chng, Eng Siong, Li, Sheng

论文摘要

为了让最新的端到端ASR模型享有数据效率，以及通过多模式培训的更多未配对的文本数据，一个需要解决两个问题：1）语音和语言之间的特征抽样率的同步性（AKA文本数据）； 2）从两个编码者那里学到的表示形式的同质性。在本文中，我们建议采用一种新型的双向注意机制（BIAM）共同学习ASR编码器（底层）和文本编码器，并使用多模式学习方法。 BIAM将促进特征采样率汇率，以实现转换功能的质量，以在另一个空间中测量具有多样化的目标功能。结果，语音表示形式充满了更多的语言信息，而文本编码器产生的表示形式与相应的语音更相似，因此共享的ASR模型更适合预处理文本数据。为了验证所提出方法的疗效，我们在有或没有额外的未配对文本数据的情况下执行两类实验。关于Librispeech语料库的实验结果表明，只有配对的数据学习只能实现高达6.15％的单词错误率（WERR），而当使用更多未配对的文本数据时，则可以实现9.23％的WERR。

To let the state-of-the-art end-to-end ASR model enjoy data efficiency, as well as much more unpaired text data by multi-modal training, one needs to address two problems: 1) the synchronicity of feature sampling rates between speech and language (aka text data); 2) the homogeneity of the learned representations from two encoders. In this paper we propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method. The BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space, with diversified objective functions. As a result, the speech representations are enriched with more linguistic information, while the representations generated by the text encoder are more similar to corresponding speech ones, and therefore the shared ASR models are more amenable for unpaired text data pretraining. To validate the efficacy of the proposed method, we perform two categories of experiments with or without extra unpaired text data. Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题