论文标题
RNN-TransDucer模型的有效知识蒸馏
Efficient Knowledge Distillation for RNN-Transducer Models
论文作者
论文摘要
知识蒸馏是将知识从大型模型转移到较小模型的有效方法。蒸馏可以看作是一种模型压缩,并且在设备上的ASR应用中发挥了重要作用。在本文中,我们为RNN-TransDucer(RNN-T)模型开发了一种蒸馏方法,这是一种流行的端对端神经网络架构,用于流语音识别。我们提出的蒸馏损失是简单有效的,并且仅使用RNN-T输出概率晶格的“ Y”和“空白”后验概率。我们研究了提出的方法在提高稀疏RNN-T模型的准确性方面的有效性,该模型通过逐渐修剪较大的未压缩模型,该模型在蒸馏过程中也是教师。在嘈杂的Farfield评估集中,随着蒸馏量为60%和90%稀疏的多域RNN-T模型,我们分别降低了4.3%和12.1%。我们还介绍了在LibrisPeech上的实验结果,其中蒸馏损失的引入可在小构象异构体模型的其他数据集上降低4.8%的相对相对降低。
Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition. Our proposed distillation loss is simple and efficient, and uses only the "y" and "blank" posterior probabilities from the RNN-T output probability lattice. We study the effectiveness of the proposed approach in improving the accuracy of sparse RNN-T models obtained by gradually pruning a larger uncompressed model, which also serves as the teacher during distillation. With distillation of 60% and 90% sparse multi-domain RNN-T models, we obtain WER reductions of 4.3% and 12.1% respectively, on a noisy FarField eval set. We also present results of experiments on LibriSpeech, where the introduction of the distillation loss yields a 4.8% relative WER reduction on the test-other dataset for a small Conformer model.