精确的韵律克隆在零射击多钟文本到语音中

论文标题

精确的韵律克隆在零射击多钟文本到语音中

Exact Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

论文作者

Lux, Florian, Koch, Julia, Vu, Ngoc Thang

论文摘要

使用未转录的参考样本来克隆说话者的声音是现代神经文本到语音（TTS）方法的巨大进步之一。最近还提出了模仿转录参考音频的韵律的方法。在这项工作中，我们首次将这两个任务与话语级别的扬声器嵌入在一起，将这两个任务结合在一起。我们进一步介绍了一个轻巧的对准器，用于提取细粒的韵律特征，可以在几秒钟内对单个样本进行填充。我们表明，正如我们的客观评估和人类研究表明，我们可以独立地独立地克隆说话者的声音以及独立语言参考的韵律，而不会与原始声音和韵律相似，这是我们的客观评估和人类研究表明。我们的所有代码和训练有素的模型都可以与静态和交互式演示一起使用。

The cloning of a speaker's voice using an untranscribed reference sample is one of the great advances of modern neural text-to-speech (TTS) methods. Approaches for mimicking the prosody of a transcribed reference audio have also been proposed recently. In this work, we bring these two tasks together for the first time through utterance level normalization in conjunction with an utterance level speaker embedding. We further introduce a lightweight aligner for extracting fine-grained prosodic features, that can be finetuned on individual samples within seconds. We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as our objective evaluation and human study show. All of our code and trained models are available, alongside static and interactive demos.

下载PDF全文

下载文献需遵守相关版权规定

论文标题