论文标题
ASR2K:大约2000种语言的语音识别,没有音频
ASR2K: Speech Recognition for Around 2000 Languages without Audio
论文作者
论文摘要
最新的语音识别模型依赖于大型监督数据集,这些数据集对于许多低资源语言中都无法使用。在这项工作中,我们提出了一条语音识别管道,该管道不需要目标语言的任何音频。唯一的假设是我们可以访问原始文本数据集或一组N-Gram统计信息。我们的语音管道由三个组成部分组成:声学,发音和语言模型。与标准管道不同,我们的声学和发音模型在没有任何监督的情况下使用多语言模型。语言模型是使用n-gram统计信息或原始文本数据集构建的。我们通过将其与Crubadan结合使用:一种大型濒临灭绝的语言N-gram数据库来构建1909年语言的语音识别。此外,我们在两个数据集中测试了129种语言的方法:常见语音和CMU Wilderness数据集。我们在使用Crubadan统计数据的荒野数据集上实现了50%的CER和74%的CER,并在使用10000原始文本发言时将其提高到45%的CER和69%的CER和69%。
Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.