论文标题

COVOST:多种多样的多语言语音到文本翻译语料库

CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus

论文作者

Wang, Changhan, Pino, Juan, Wu, Anne, Gu, Jiatao

论文摘要

由于端到端模型的发展以及创建新的Corpora,例如增强的Librispeech和Rest-C,口语翻译最近见证了人们的流行。现有的数据集涉及语言对与英语作为源语言,涉及非常特定的域或资源低。我们介绍了Covost,这是一种多种语言的语音翻译语料库,从11种语言到英语,具有11,000多名演讲者和60多个口音。我们描述了数据集创建方法,并提供了数据质量的经验证据。我们还提供初始的基准,包括我们所知,是口语翻译的第一个端到端多语言模型。 Covost根据CC0许可发布,免费使用。我们还提供了根据CC许可证的Tatoeba得出的其他评估数据。

Spoken language translation has recently witnessed a resurgence in popularity, thanks to the development of end-to-end models and the creation of new corpora, such as Augmented LibriSpeech and MuST-C. Existing datasets involve language pairs with English as a source language, involve very specific domains or are low resource. We introduce CoVoST, a multilingual speech-to-text translation corpus from 11 languages into English, diversified with over 11,000 speakers and over 60 accents. We describe the dataset creation methodology and provide empirical evidence of the quality of the data. We also provide initial benchmarks, including, to our knowledge, the first end-to-end many-to-one multilingual models for spoken language translation. CoVoST is released under CC0 license and free to use. We also provide additional evaluation data derived from Tatoeba under CC licenses.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源