RNN-TransDucer模型的有效知识蒸馏

论文标题

RNN-TransDucer模型的有效知识蒸馏

Efficient Knowledge Distillation for RNN-Transducer Models

论文作者

Panchapagesan, Sankaran, Park, Daniel S., Chiu, Chung-Cheng, Shangguan, Yuan, Liang, Qiao, Gruenstein, Alexander

论文摘要

知识蒸馏是将知识从大型模型转移到较小模型的有效方法。蒸馏可以看作是一种模型压缩，并且在设备上的ASR应用中发挥了重要作用。在本文中，我们为RNN-TransDucer（RNN-T）模型开发了一种蒸馏方法，这是一种流行的端对端神经网络架构，用于流语音识别。我们提出的蒸馏损失是简单有效的，并且仅使用RNN-T输出概率晶格的“ Y”和“空白”后验概率。我们研究了提出的方法在提高稀疏RNN-T模型的准确性方面的有效性，该模型通过逐渐修剪较大的未压缩模型，该模型在蒸馏过程中也是教师。在嘈杂的Farfield评估集中，随着蒸馏量为60％和90％稀疏的多域RNN-T模型，我们分别降低了4.3％和12.1％。我们还介绍了在LibrisPeech上的实验结果，其中蒸馏损失的引入可在小构象异构体模型的其他数据集上降低4.8％的相对相对降低。

Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition. Our proposed distillation loss is simple and efficient, and uses only the "y" and "blank" posterior probabilities from the RNN-T output probability lattice. We study the effectiveness of the proposed approach in improving the accuracy of sparse RNN-T models obtained by gradually pruning a larger uncompressed model, which also serves as the teacher during distillation. With distillation of 60% and 90% sparse multi-domain RNN-T models, we obtain WER reductions of 4.3% and 12.1% respectively, on a noisy FarField eval set. We also present results of experiments on LibriSpeech, where the introduction of the distillation loss yields a 4.8% relative WER reduction on the test-other dataset for a small Conformer model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题