ASR2K：大约2000种语言的语音识别，没有音频

论文标题

ASR2K：大约2000种语言的语音识别，没有音频

ASR2K: Speech Recognition for Around 2000 Languages without Audio

论文作者

Li, Xinjian, Metze, Florian, Mortensen, David R, Black, Alan W, Watanabe, Shinji

论文摘要

最新的语音识别模型依赖于大型监督数据集，这些数据集对于许多低资源语言中都无法使用。在这项工作中，我们提出了一条语音识别管道，该管道不需要目标语言的任何音频。唯一的假设是我们可以访问原始文本数据集或一组N-Gram统计信息。我们的语音管道由三个组成部分组成：声学，发音和语言模型。与标准管道不同，我们的声学和发音模型在没有任何监督的情况下使用多语言模型。语言模型是使用n-gram统计信息或原始文本数据集构建的。我们通过将其与Crubadan结合使用：一种大型濒临灭绝的语言N-gram数据库来构建1909年语言的语音识别。此外，我们在两个数据集中测试了129种语言的方法：常见语音和CMU Wilderness数据集。我们在使用Crubadan统计数据的荒野数据集上实现了50％的CER和74％的CER，并在使用10000原始文本发言时将其提高到45％的CER和69％的CER和69％。

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50% CER and 74% WER on the Wilderness dataset with Crubadan statistics only and improve them to 45% CER and 69% WER when using 10000 raw text utterances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题