论文标题

具有表达和相干韵律的简单有效的多句子TT

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

论文作者

Makarov, Peter, Abbas, Ammar, Łajszczak, Mateusz, Joly, Arnaud, Karlapati, Sri, Moinet, Alexis, Drugman, Thomas, Karanasou, Penny

论文摘要

产生表达和上下文适当的韵律仍然是现代文本到语音(TTS)系统的挑战。对于长,多句的输入,这尤其明显。在本文中,我们检查了基于变压器的类似快速语音的系统的简单扩展,目的是改善多句子TT的韵律。我们发现,漫长的上下文,强大的文本功能以及对多演讲者数据的培训都改善了韵律。更有趣的是,它们产生协同作用。长篇小说使韵律歧视,提高了连贯性,并发挥了变形金刚的优势。来自强大的语言模型(例如BERT)的微调单词级功能似乎从更多培训数据中获利,在多扬声器设置中很容易获得。我们调查有关暂停和起搏的客观指标,并对语音自然进行彻底的主观评估。我们的主要系统结合了所有扩展,取得了持续的良好结果,包括对所有竞争对手的言语自然性的统计显着改善。

Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on multi-speaker data all improve prosody. More interestingly, they result in synergies. Long context disambiguates prosody, improves coherence, and plays to the strengths of Transformers. Fine-tuning word-level features from a powerful language model, such as BERT, appears to profit from more training data, readily available in a multi-speaker setting. We look into objective metrics on pausing and pacing and perform thorough subjective evaluations for speech naturalness. Our main system, which incorporates all the extensions, achieves consistently strong results, including statistically significant improvements in speech naturalness over all its competitors.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源