GlowVC：MEL-SPECTROGRAM空间无语言语音转换模型

论文标题

GlowVC：MEL-SPECTROGRAM空间无语言语音转换模型

GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion

论文作者

Proszewska, Magdalena, Beringer, Grzegorz, Sáez-Trigueros, Daniel, Merritt, Thomas, Ezzerg, Abdelhamid, Barra-Chicote, Roberto

论文摘要

在本文中，我们提出了GLOWVC：一种基于多语言的多语言流程模型，用于与语言无关的语音转换。我们建立在Glow-TTS上，该架构提供了一个架构，该体系结构可以在训练过程中使用语言特征，而无需将其用于VC推理。我们考虑了我们的模型的两个版本：GlowVC条件和glowVC-阐释。 GLOWVC条件模拟了用扬声器条件流的流量和将MEL光谱图的分布分布到与内容相关的尺寸和相关的尺寸，而GlowVC-Suppific-opplicing模型的显式分布模拟了无条件的流量和无条件的流程，并将空间解散到内容，音高，音高和扬声器 - 和扬声器 - 和扬声器 - 且优惠的尺寸。我们以可见和看不见的语言的内部和跨语性转换来评估我们的模型，说话者的相似性和自然性。 GlowVC在清晰度方面的模型大大优于AutoVC基线，同时在语言内VC中获得了高扬声器的相似性，并且在跨语性环境中稍差。此外，我们证明了GlowVC解释在自然性方面超过了GlowVC条件和AUTOVC。

In this paper, we propose GlowVC: a multilingual multi-speaker flow-based model for language-independent text-free voice conversion. We build on Glow-TTS, which provides an architecture that enables use of linguistic features during training without the necessity of using them for VC inference. We consider two versions of our model: GlowVC-conditional and GlowVC-explicit. GlowVC-conditional models the distribution of mel-spectrograms with speaker-conditioned flow and disentangles the mel-spectrogram space into content- and pitch-relevant dimensions, while GlowVC-explicit models the explicit distribution with unconditioned flow and disentangles said space into content-, pitch- and speaker-relevant dimensions. We evaluate our models in terms of intelligibility, speaker similarity and naturalness for intra- and cross-lingual conversion in seen and unseen languages. GlowVC models greatly outperform AutoVC baseline in terms of intelligibility, while achieving just as high speaker similarity in intra-lingual VC, and slightly worse in the cross-lingual setting. Moreover, we demonstrate that GlowVC-explicit surpasses both GlowVC-conditional and AutoVC in terms of naturalness.

下载PDF全文

下载文献需遵守相关版权规定

论文标题