论文标题
口语识别的训练方法:TALTECH提交OLR 2021挑战
Pretraining Approaches for Spoken Language Recognition: TalTech Submission to the OLR 2021 Challenge
论文作者
论文摘要
本文调查了口语识别的不同训练方法。该论文基于我们对东方语言识别2021挑战的提交。我们参与了挑战的两条轨道:受约束和不受限制的语言识别。对于受约束的轨道,我们首先使用具有可用成绩单的提供的培训数据训练了基于总体器的编码器模型,用于多语言自动语音识别(ASR)。然后,将多语言ASR模型的共享编码器用于语言标识任务。对于不受限制的任务,我们依赖于外部预审预周封的模型以及外部数据:多语言XLSR-53 WAV2VEC2.0模型在Voxlingua107上进行了审核,用于语言识别任务,并最终在提供的目标语言培训数据上对CommorVoice Data进行增强。我们的主要度量$ c _ {\ rm avg} $值在约束任务中为0.0079,而无约束的任务为0.0119,这在两个排名中均为第二名。在评估后实验中,我们研究了训练准确的后端模型所需的目标语言数据量,多语言预处理数据的重要性,并将不同的模型视为鉴定起点的起点。
This paper investigates different pretraining approaches to spoken language identification. The paper is based on our submission to the Oriental Language Recognition 2021 Challenge. We participated in two tracks of the challenge: constrained and unconstrained language recognition. For the constrained track, we first trained a Conformer-based encoder-decoder model for multilingual automatic speech recognition (ASR), using the provided training data that had transcripts available. The shared encoder of the multilingual ASR model was then finetuned for the language identification task. For the unconstrained task, we relied on both externally available pretrained models as well as external data: the multilingual XLSR-53 wav2vec2.0 model was finetuned on the VoxLingua107 corpus for the language recognition task, and finally finetuned on the provided target language training data, augmented with CommonVoice data. Our primary metric $C_{\rm avg}$ values on the Test set are 0.0079 for the constrained task and 0.0119 for the unconstrained task which resulted in the second place in both rankings. In post-evaluation experiments, we study the amount of target language data needed for training an accurate backend model, the importance of multilingual pretraining data, and compare different models as finetuning starting points.