论文标题
通过基于图的时间分类的半监督语音识别
Semi-Supervised Speech Recognition via Graph-based Temporal Classification
论文作者
论文摘要
半监督学习通过使用种子ASR模型与未标记数据生成的伪标签的种子ASR模型进行自我训练,在自动语音识别(ASR)方面表现出了令人鼓舞的结果。这种方法的有效性在很大程度上取决于伪标签的准确性,通常仅使用1好的ASR假设。但是,N最佳列表的替代性ASR假设可以为未标记的语音发音提供更准确的标签,并且还反映了种子ASR模型的不确定性。在本文中,我们提出了一种通用形式的连接派时间分类(CTC)目标,该目标接受训练标签的图表。新提出的基于图的时间分类(GTC)目标用于使用基于WFST的监督进行自我训练,该监管是从伪标签的N最佳列表中生成的。在此设置中,GTC不仅用于学习与CTC相似的时间对齐,还用于学习标签对齐,以从加权图获得最佳的伪标签序列。结果表明,这种方法可以有效利用具有相关分数的N-最佳伪标签清单,超过标准的伪标记,而ASR结果接近Oracle实验,其中N-最佳列表的最佳假设是手动选择的。
Semi-supervised learning has demonstrated promising results in automatic speech recognition (ASR) by self-training using a seed ASR model with pseudo-labels generated for unlabeled data. The effectiveness of this approach largely relies on the pseudo-label accuracy, for which typically only the 1-best ASR hypothesis is used. However, alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance and also reflect uncertainties of the seed ASR model. In this paper, we propose a generalized form of the connectionist temporal classification (CTC) objective that accepts a graph representation of the training labels. The newly proposed graph-based temporal classification (GTC) objective is applied for self-training with WFST-based supervision, which is generated from an N-best list of pseudo-labels. In this setup, GTC is used to learn not only a temporal alignment, similarly to CTC, but also a label alignment to obtain the optimal pseudo-label sequence from the weighted graph. Results show that this approach can effectively exploit an N-best list of pseudo-labels with associated scores, considerably outperforming standard pseudo-labeling, with ASR results approaching an oracle experiment in which the best hypotheses of the N-best lists are selected manually.