跨语言文本到语音和语音转换的潜在语言嵌入

论文标题

跨语言文本到语音和语音转换的潜在语言嵌入

Latent linguistic embedding for cross-lingual text-to-speech and voice conversion

论文作者

Luong, Hieu-Thi, Yamagishi, Junichi

论文摘要

正如最近提出的语音克隆系统Nautilus能够使用未转录的语音克隆看不见的声音，我们研究了使用它来开发统一的跨语义TTS/VC系统的可行性。跨语性的语音产生是一种场景，在这种情况下，用目标说话者的声音以他们最初不说的语言而产生的语音语音。这种类型的系统不仅是克隆目标扬声器的声音，而且基本上创建了一种新的声音，可以将其视为比特定框架下的原始声音更好。通过使用训练有素的英语潜在语言嵌入来为2020年语音转换挑战挑战中的几个德语，芬兰语和普通话扬声器创建跨语言TTS和VC系统，我们表明我们的方法不仅可以与高扬声器相似性创建跨语言VC，而且可以无需执行任何其他步骤即可无需执行任何其他步骤。但是，目标扬声器的主观评估似乎有所不同，这是未来改进的一个方面。

As the recently proposed voice cloning system, NAUTILUS, is capable of cloning unseen voices using untranscribed speech, we investigate the feasibility of using it to develop a unified cross-lingual TTS/VC system. Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally. This type of system is not simply cloning the voice of the target speaker, but essentially creating a new voice that can be considered better than the original under a specific framing. By using a well-trained English latent linguistic embedding to create a cross-lingual TTS and VC system for several German, Finnish, and Mandarin speakers included in the Voice Conversion Challenge 2020, we show that our method not only creates cross-lingual VC with high speaker similarity but also can be seamlessly used for cross-lingual TTS without having to perform any extra steps. However, the subjective evaluations of perceived naturalness seemed to vary between target speakers, which is one aspect for future improvement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题