在ASR中连续软伪标记

论文标题

在ASR中连续软伪标记

Continuous Soft Pseudo-Labeling in ASR

论文作者

Likhomanenko, Tatiana, Collobert, Ronan, Jaitly, Navdeep, Bengio, Samy

论文摘要

连续的伪标记（PL）算法（例如Slimipl）最近成为了语音识别中半监督学习的有力策略。与训练模型和生成伪标签（PLS）之间交替的早期策略相反，随着培训的进行，请在此处以端到端方式生成PL，从而提高训练速度和最终模型的准确性。 PL与诸如蒸馏之类的教师模型分享一个共同的主题，因为教师模型生成了需要模仿的学生模型的目标。但是，有趣的是，PL策略一般使用硬标签，而蒸馏使用标签上的分布作为模拟目标。受蒸馏的启发，我们期望将整个分布（又称软标签）上的序列指定为未标记数据的目标，而不是单个最佳通过伪标记的转录本（硬标签）应提高PL性能和收敛性。令人惊讶的是，我们发现软标签目标可以导致训练差异，并且该模型崩溃到每个框架的变性令牌分布。我们假设这不是硬标签发生的原因是，硬标签上的训练损失施加了序列级别的一致性，从而使模型无法崩溃到退化解决方案。在本文中，我们展示了几个支持这一假设的实验，并尝试了几种正则化方法，可以改善使用软标签时退化崩溃的方法。这些方法可以使软标签的准确性更接近硬牌标签，尽管它们尚无法胜过它们，但它们是进一步改进的有用框架。

Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final model. PL shares a common theme with teacher-student models such as distillation in that a teacher model generates targets that need to be mimicked by the student model being trained. However, interestingly, PL strategies in general use hard-labels, whereas distillation uses the distribution over labels as the target to mimic. Inspired by distillation we expect that specifying the whole distribution (aka soft-labels) over sequences as the target for unlabeled data, instead of a single best pass pseudo-labeled transcript (hard-labels) should improve PL performance and convergence. Surprisingly and unexpectedly, we find that soft-labels targets can lead to training divergence, with the model collapsing to a degenerate token distribution per frame. We hypothesize that the reason this does not happen with hard-labels is that training loss on hard-labels imposes sequence-level consistency that keeps the model from collapsing to the degenerate solution. In this paper, we show several experiments that support this hypothesis, and experiment with several regularization approaches that can ameliorate the degenerate collapse when using soft-labels. These approaches can bring the accuracy of soft-labels closer to that of hard-labels, and while they are unable to outperform them yet, they serve as a useful framework for further improvements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题