论文标题
使用AR和基于流的先验网络来预测音素级别的韵律潜伏期,以表达语音综合
Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis
论文作者
论文摘要
表达语音综合文献的很大一部分集中于学习语音信号的韵律表示,然后在推断过程中以先验分布进行建模。在本文中,我们在预测使用无监督FVAE模型提取的音素级别的韵律表示的任务下比较了不同的先前体系结构。我们使用主观和客观指标来表明基于流动的先验网络可以导致更具表现力的语音,而质量略有下降。此外,我们表明,由于归一化流的性质,合成的语音对于给定文本具有更高的可变性。我们还提出了一个动力学VAE模型,该模型可以产生更高质量的语音,尽管与基于流的模型相比,表现力和可变性降低。
A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics to show that normalizing flow based prior networks can result in more expressive speech at the cost of a slight drop in quality. Furthermore, we show that the synthesized speech has higher variability, for a given text, due to the nature of normalizing flows. We also propose a Dynamical VAE model, that can generate higher quality speech although with decreased expressiveness and variability compared to the flow based models.