论文标题

基于WAV2VEC2的基于自我监督学习方法的实验研究,以改善儿童语音识别

A Wav2vec2-Based Experimental Study on Self-Supervised Learning Methods to Improve Child Speech Recognition

论文作者

Jain, Rishabh, Barcovschi, Andrei, Yiwere, Mariam, Bigioi, Dan, Corcoran, Peter, Cucu, Horia

论文摘要

尽管深度学习技术最近取得了进步,但儿童语音识别仍然是一项具有挑战性的任务。当前的自动语音识别(ASR)模型需要大量的注释数据进行培训,这是稀缺的。在这项工作中,我们使用ASR模型WAV2VEC2进行探索,并具有不同的预处理和填充配置,以进行自我监督学习(SSL),以改善自动儿童语音识别。使用不同数量的儿童语音培训数据,成人语音数据和两者组合,对验证的WAV2VEC2模型进行了审核,以发现对儿童ASR任务的模型所需的最佳数据。与以前的任何其他方法相比,我们训练有素的模型在Myst Child Speech数据集上达到了最佳单词错误率(WER),在Myst Child Speech DataSet上达到了7.42的最佳单词错误率(WER),PFSTAR数据集上的2.99在CMU Kids数据集上达到了12.47。我们的模型在儿童语音上的表现优于WAV2VEC2基础960,这被认为是成人语音上的最新ASR模型,仅在Finetuning中使用10个小时的儿童语音数据。通过在预训练,填充和推理中使用数据集的组合,还提供了对不同类型的培训数据及其对推理的影响的分析。

Despite recent advancements in deep learning technologies, Child Speech Recognition remains a challenging task. Current Automatic Speech Recognition (ASR) models require substantial amounts of annotated data for training, which is scarce. In this work, we explore using the ASR model, wav2vec2, with different pretraining and finetuning configurations for self-supervised learning (SSL) toward improving automatic child speech recognition. The pretrained wav2vec2 models were finetuned using different amounts of child speech training data, adult speech data, and a combination of both, to discover the optimum amount of data required to finetune the model for the task of child ASR. Our trained model achieves the best Word Error Rate (WER) of 7.42 on the MyST child speech dataset, 2.99 on the PFSTAR dataset and 12.47 on the CMU KIDS dataset as compared to any other previous methods. Our models outperformed the wav2vec2 BASE 960 on child speech which is considered a state-of-the-art ASR model on adult speech by just using 10 hours of child speech data in finetuning. The analysis of different types of training data and their effect on inference is also provided by using a combination of datasets in pretraining, finetuning and inference.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源