注意语音识别模型对跨域的话语不当

论文标题

注意语音识别模型对跨域的话语不当

Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

论文作者

Keung, Phillip, Niu, Wei, Lu, Yichao, Salazar, Julian, Bhardwaj, Vikas

论文摘要

我们讨论了自动语音识别的自动回归序列到序列架构中的电子学转录问题，其中模型在带有域外的话语时会产生非常长的重复输出序列。我们使用仅在Librispeech语料库中训练的注意编码器模型来解码英国国家语料库的音频。我们观察到，有许多5秒的录音产生了500多个字符的解码输出（即每秒100多个字符）。在相同数据上训练的框架同步混合体（DNN-HMM）模型不会产生这些异常长的转录本。这些解码问题在ESPNET的语音变压器模型中可再现，并且在自我发挥的CTC模型中较小程度，这表明这些问题对于使用注意机制是固有的。我们创建一个单独的长度预测模型，以预测输出中正确数量的单词数量，这使我们能够识别和截断有问题的解码结果，而无需在librispeech任务上增加单词错误率。

We discuss the problem of echographic transcription in autoregressive sequence-to-sequence attentional architectures for automatic speech recognition, where a model produces very long sequences of repetitive outputs when presented with out-of-domain utterances. We decode audio from the British National Corpus with an attentional encoder-decoder model trained solely on the LibriSpeech corpus. We observe that there are many 5-second recordings that produce more than 500 characters of decoding output (i.e. more than 100 characters per second). A frame-synchronous hybrid (DNN-HMM) model trained on the same data does not produce these unusually long transcripts. These decoding issues are reproducible in a speech transformer model from ESPnet, and to a lesser extent in a self-attention CTC model, suggesting that these issues are intrinsic to the use of the attention mechanism. We create a separate length prediction model to predict the correct number of wordpieces in the output, which allows us to identify and truncate problematic decoding results without increasing word error rates on the LibriSpeech task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题