论文标题
用于德国端到端语音识别的大型语料库的CTC细分
CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition
论文作者
论文摘要
最近的端到端自动语音识别(ASR)系统表明能够胜过常规混合DNN/ HMM ASR。除了这些系统的建筑改进外,这些模型还随着深度,参数和模型容量而增长。但是,这些模型还需要更多的培训数据才能实现可比的性能。 在这项工作中,我们将免费的语音识别语料库(包括但未标记的语音数据)结合到了超过1700美元的语音数据的大数据集。为了进行数据准备,我们提出了一种两阶段方法,该方法使用通过连接派时间分类(CTC)预先训练的ASR模型从未分段或未标记的培训数据中引导更多培训数据。然后从从经过CTC训练的网络中获得的标签概率中提取话语,以确定段比对。借助此培训数据,我们培训了一个混合CTC/注意力变压器模型,该模型在TUDA-DE测试集中达到了$ 12.8 \%$ wer,超过了先前的基线$ 14.4 \%\%\%\%的$ \%\%$ $ \%\%$。
Recent end-to-end Automatic Speech Recognition (ASR) systems demonstrated the ability to outperform conventional hybrid DNN/ HMM ASR. Aside from architectural improvements in those systems, those models grew in terms of depth, parameters and model capacity. However, these models also require more training data to achieve comparable performance. In this work, we combine freely available corpora for German speech recognition, including yet unlabeled speech data, to a big dataset of over $1700$h of speech data. For data preparation, we propose a two-stage approach that uses an ASR model pre-trained with Connectionist Temporal Classification (CTC) to boot-strap more training data from unsegmented or unlabeled training data. Utterances are then extracted from label probabilities obtained from the network trained with CTC to determine segment alignments. With this training data, we trained a hybrid CTC/attention Transformer model that achieves $12.8\%$ WER on the Tuda-DE test set, surpassing the previous baseline of $14.4\%$ of conventional hybrid DNN/HMM ASR.