贝叶斯对LF-MMI训练的时间延迟神经网络进行语音识别

论文标题

贝叶斯对LF-MMI训练的时间延迟神经网络进行语音识别

Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

论文作者

Hu, Shoukang, Xie, Xurong, Liu, Shansong, Yu, Jianwei, Ye, Zi, Geng, Mengzhe, Liu, Xunying, Meng, Helen

论文摘要

判别训练技术定义了自动语音识别系统的最先进性能。但是，它们本质上容易过度拟合，从而导致使用有限的培训数据时的概括性能差。为了解决此问题，本文提出了一个完整的贝叶斯框架，以说明对TDNN声学模型的序列判别训练的模型不确定性。提出了几种基于贝叶斯学习的TDNN变体系统，以模拟对权重参数的不确定性和隐藏激活功能的选择或隐藏层输出的选择。使用少数作为一个单个参数样本的有效变异推理方法可确保其在培训和评估时间中与基线TDNN系统相当的计算成本。 Statistically significant word error rate (WER) reductions of 0.4%-1.8% absolute (5%-11% relative) were obtained over a state-of-the-art 900 hour speed perturbed Switchboard corpus trained baseline LF-MMI factored TDNN system using multiple regularization methods including F-smoothing, L2 norm penalty, natural gradient, model averaging and dropout, in addition to i-Vector plus learning hidden unit contribution （LHUC）基于扬声器的适应和RNNLM撤退。在450小时的HKUST对话普通话电话识别任务中，还获得了一致的性能改进。在第三个跨域的适应任务上，需要快速将1000小时的Librispeech数据训练系统移植到小型痴呆型老年人语音语料库中时，提议的贝叶斯TDNN LF-MMI Systems使用直接的重量降低了2.5 \ \％的绝对减少。

Discriminative training techniques define state-of-the-art performance for automatic speech recognition systems. However, they are inherently prone to overfitting, leading to poor generalization performance when using limited training data. In order to address this issue, this paper presents a full Bayesian framework to account for model uncertainty in sequence discriminative training of factored TDNN acoustic models. Several Bayesian learning based TDNN variant systems are proposed to model the uncertainty over weight parameters and choices of hidden activation functions, or the hidden layer outputs. Efficient variational inference approaches using a few as one single parameter sample ensure their computational cost in both training and evaluation time comparable to that of the baseline TDNN systems. Statistically significant word error rate (WER) reductions of 0.4%-1.8% absolute (5%-11% relative) were obtained over a state-of-the-art 900 hour speed perturbed Switchboard corpus trained baseline LF-MMI factored TDNN system using multiple regularization methods including F-smoothing, L2 norm penalty, natural gradient, model averaging and dropout, in addition to i-Vector plus learning hidden unit contribution (LHUC) based speaker adaptation and RNNLM rescoring. Consistent performance improvements were also obtained on a 450 hour HKUST conversational Mandarin telephone speech recognition task. On a third cross domain adaptation task requiring rapidly porting a 1000 hour LibriSpeech data trained system to a small DementiaBank elderly speech corpus, the proposed Bayesian TDNN LF-MMI systems outperformed the baseline system using direct weight fine-tuning by up to 2.5\% absolute WER reduction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题