改善RNN-T中声学和文本表示的融合

论文标题

改善RNN-T中声学和文本表示的融合

Improving the fusion of acoustic and text representations in RNN-T

论文作者

Zhang, Chao, Li, Bo, Lu, Zhiyun, Sainath, Tara N., Chang, Shuo-yiin

论文摘要

复发性神经网络传感器（RNN-T）最近已成为流式传输自动语音识别（ASR）的主流端到端方法。为了估算子词单元的输出分布，RNN-T使用完全连接的层作为关节网络，将使用声学编码器与基于先前子单词单元的预测网络获得的文本表示提取的声明融合。在本文中，我们建议使用门控，双线性合并以及在关节网络中的组合，以产生更具表现力的表示形式，以进出输出层。还提出了一种正规化方法，以通过在RNN-T训练开始时将后传达到预测网络的梯度减少后，以实现更好的声学编码器训练。在九种语言上进行语音搜索的多语言ASR设置的实验结果表明，所提出的方法的联合使用可能会导致4％-5％的相对单词错误率降低，只有几百万个额外的参数。

The recurrent neural network transducer (RNN-T) has recently become the mainstream end-to-end approach for streaming automatic speech recognition (ASR). To estimate the output distributions over subword units, RNN-T uses a fully connected layer as the joint network to fuse the acoustic representations extracted using the acoustic encoder with the text representations obtained using the prediction network based on the previous subword units. In this paper, we propose to use gating, bilinear pooling, and a combination of them in the joint network to produce more expressive representations to feed into the output layer. A regularisation method is also proposed to enable better acoustic encoder training by reducing the gradients back-propagated into the prediction network at the beginning of RNN-T training. Experimental results on a multilingual ASR setting for voice search over nine languages show that the joint use of the proposed methods can result in 4%--5% relative word error rate reductions with only a few million extra parameters.

下载PDF全文

下载文献需遵守相关版权规定

论文标题