论文标题
通过语音翻译改进端到端语音识别的跨语性转移学习
Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation
论文作者
论文摘要
已知从高资源语言中转移学习是改善低资源语言端到端自动语音识别(ASR)的有效方法。但是,预先训练或共同训练的编码模型模型并不共享同一语言的语言建模(解码器),这对于遥远的目标语言可能是低效的。我们将语音到文本翻译(ST)作为辅助任务介绍,以结合目标语言的其他知识,并可以从该目标语言转移。具体而言,我们首先将高资源ASR转录本转化为目标低资源语言,并使用ST模型进行了培训。 ST和Target ASR都共享相同的基于注意力的编码器架构和词汇。然后,前者为后者提供了完全预先训练的模型,将降低到基线的单词错误率(WER)降低24.6%(直接从高资源ASR转移)。我们表明,没有人翻译的培训ST。通过机器翻译(MT)伪标签训练的ST可以带来一致的收益。通过仅利用50万MT的示例,它甚至可以胜过使用人类标签的人。即使有来自低资源MT(200K示例)的伪标记,ST增强的转移也会使高达8.9%的降低到直接转移。
Transfer learning from high-resource languages is known to be an efficient way to improve end-to-end automatic speech recognition (ASR) for low-resource languages. Pre-trained or jointly trained encoder-decoder models, however, do not share the language modeling (decoder) for the same language, which is likely to be inefficient for distant target languages. We introduce speech-to-text translation (ST) as an auxiliary task to incorporate additional knowledge of the target language and enable transferring from that target language. Specifically, we first translate high-resource ASR transcripts into a target low-resource language, with which a ST model is trained. Both ST and target ASR share the same attention-based encoder-decoder architecture and vocabulary. The former task then provides a fully pre-trained model for the latter, bringing up to 24.6% word error rate (WER) reduction to the baseline (direct transfer from high-resource ASR). We show that training ST with human translations is not necessary. ST trained with machine translation (MT) pseudo-labels brings consistent gains. It can even outperform those using human labels when transferred to target ASR by leveraging only 500K MT examples. Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.