论文标题
跨语言文本到语音和语音转换的潜在语言嵌入
Latent linguistic embedding for cross-lingual text-to-speech and voice conversion
论文作者
论文摘要
正如最近提出的语音克隆系统Nautilus能够使用未转录的语音克隆看不见的声音,我们研究了使用它来开发统一的跨语义TTS/VC系统的可行性。跨语性的语音产生是一种场景,在这种情况下,用目标说话者的声音以他们最初不说的语言而产生的语音语音。这种类型的系统不仅是克隆目标扬声器的声音,而且基本上创建了一种新的声音,可以将其视为比特定框架下的原始声音更好。通过使用训练有素的英语潜在语言嵌入来为2020年语音转换挑战挑战中的几个德语,芬兰语和普通话扬声器创建跨语言TTS和VC系统,我们表明我们的方法不仅可以与高扬声器相似性创建跨语言VC,而且可以无需执行任何其他步骤即可无需执行任何其他步骤。但是,目标扬声器的主观评估似乎有所不同,这是未来改进的一个方面。
As the recently proposed voice cloning system, NAUTILUS, is capable of cloning unseen voices using untranscribed speech, we investigate the feasibility of using it to develop a unified cross-lingual TTS/VC system. Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally. This type of system is not simply cloning the voice of the target speaker, but essentially creating a new voice that can be considered better than the original under a specific framing. By using a well-trained English latent linguistic embedding to create a cross-lingual TTS and VC system for several German, Finnish, and Mandarin speakers included in the Voice Conversion Challenge 2020, we show that our method not only creates cross-lingual VC with high speaker similarity but also can be seamlessly used for cross-lingual TTS without having to perform any extra steps. However, the subjective evaluations of perceived naturalness seemed to vary between target speakers, which is one aspect for future improvement.