带有声词嵌入的全词分段语音识别

论文标题

带有声词嵌入的全词分段语音识别

Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

论文作者

Shi, Bowen, Settle, Shane, Livescu, Karen

论文摘要

分段模型是序列预测模型，其中假设的得分基于整个框架的整个可变长度段。我们考虑使用段的向量嵌入的全词（“声学到字”）语音识别的分段模型（“声学到字”）。此类模型在计算上具有挑战性，因为路径的数量与词汇大小成正比，这比使用像手机之类的子字样时大的数量级。我们描述了端到端全词分段模型的有效方法，在GPU上执行了前向后和Viterbi解码，并且简单的段评分函数可降低空间的复杂性。此外，我们通过训练有素训练的声学嵌入（AWES）和书面单词标签的声学扎根嵌入（AGWES）研究了预训练的使用。我们发现，可以通过对声学段表示敬畏的训练来降低单词错误率，并且可以通过预先训练AGWES的单词预测层来获得其他（较小）的收益。我们的最终模型比先前的A2W模型改进。

Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we investigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional (smaller) gains can be obtained by pre-training the word prediction layer with AGWEs. Our final models improve over prior A2W models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题