通过基于图的时间分类的半监督语音识别

论文标题

通过基于图的时间分类的半监督语音识别

Semi-Supervised Speech Recognition via Graph-based Temporal Classification

论文作者

Moritz, Niko, Hori, Takaaki, Roux, Jonathan Le

论文摘要

半监督学习通过使用种子ASR模型与未标记数据生成的伪标签的种子ASR模型进行自我训练，在自动语音识别（ASR）方面表现出了令人鼓舞的结果。这种方法的有效性在很大程度上取决于伪标签的准确性，通常仅使用1好的ASR假设。但是，N最佳列表的替代性ASR假设可以为未标记的语音发音提供更准确的标签，并且还反映了种子ASR模型的不确定性。在本文中，我们提出了一种通用形式的连接派时间分类（CTC）目标，该目标接受训练标签的图表。新提出的基于图的时间分类（GTC）目标用于使用基于WFST的监督进行自我训练，该监管是从伪标签的N最佳列表中生成的。在此设置中，GTC不仅用于学习与CTC相似的时间对齐，还用于学习标签对齐，以从加权图获得最佳的伪标签序列。结果表明，这种方法可以有效利用具有相关分数的N-最佳伪标签清单，超过标准的伪标记，而ASR结果接近Oracle实验，其中N-最佳列表的最佳假设是手动选择的。

Semi-supervised learning has demonstrated promising results in automatic speech recognition (ASR) by self-training using a seed ASR model with pseudo-labels generated for unlabeled data. The effectiveness of this approach largely relies on the pseudo-label accuracy, for which typically only the 1-best ASR hypothesis is used. However, alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance and also reflect uncertainties of the seed ASR model. In this paper, we propose a generalized form of the connectionist temporal classification (CTC) objective that accepts a graph representation of the training labels. The newly proposed graph-based temporal classification (GTC) objective is applied for self-training with WFST-based supervision, which is generated from an N-best list of pseudo-labels. In this setup, GTC is used to learn not only a temporal alignment, similarly to CTC, but also a label alignment to obtain the optimal pseudo-label sequence from the weighted graph. Results show that this approach can effectively exploit an N-best list of pseudo-labels with associated scores, considerably outperforming standard pseudo-labeling, with ASR results approaching an oracle experiment in which the best hypotheses of the N-best lists are selected manually.

下载PDF全文

下载文献需遵守相关版权规定

论文标题