Adaspeech 4：在零拍的场景中对语音的自适应文本

论文标题

Adaspeech 4：在零拍的场景中对语音的自适应文本

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

论文作者

Wu, Yihan, Tan, Xu, Li, Bohan, He, Lei, Zhao, Sheng, Song, Ruihua, Qin, Tao, Liu, Tie-Yan

论文摘要

对语音（TTS）的自适应文本可以通过使用训练有素的源tts模型在零镜场景中综合新声音，而无需对新扬声器的语音数据进行调整。考虑到可见和看不见的说话者具有多种特征，零拍的自适应TTS需要对说话者特征的强大概括能力，这带来了建模挑战。在本文中，我们开发了Adaspeech 4，这是一种用于高质量语音合成的零射击自适应TTS系统。我们系统地对说话者特征进行建模，以改善对新扬声器的概括。通常，可以将说话者特征的建模分为三个步骤：提取说话者表示，以此说话者表示，以这种说话者表示为条件，并综合语音/MEL-SPECTROGRAM。因此，我们分为三个步骤改进建模：1）为了更好地概括提取说话者表示，我们将说话者特征分解为基础向量，并通过加权通过关注将这些基础向量组合来提取说话者表示。 2）我们利用条件层归一化以将提取的扬声器表示形式整合到TTS模型。 3）我们根据基础向量的分布提出了一种新的监督损失，以维持生成的MEL光谱图中相应的说话者特征。在没有任何微调的情况下，Adaspeech 4比多个数据集中的基线获得更好的语音质量和相似性。

Adaptive text to speech (TTS) can synthesize new voices in zero-shot scenarios efficiently, by using a well-trained source TTS model without adapting it on the speech data of new speakers. Considering seen and unseen speakers have diverse characteristics, zero-shot adaptive TTS requires strong generalization ability on speaker characteristics, which brings modeling challenges. In this paper, we develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis. We model the speaker characteristics systematically to improve the generalization on new speakers. Generally, the modeling of speaker characteristics can be categorized into three steps: extracting speaker representation, taking this speaker representation as condition, and synthesizing speech/mel-spectrogram given this speaker representation. Accordingly, we improve the modeling in three steps: 1) To extract speaker representation with better generalization, we factorize the speaker characteristics into basis vectors and extract speaker representation by weighted combining of these basis vectors through attention. 2) We leverage conditional layer normalization to integrate the extracted speaker representation to TTS model. 3) We propose a novel supervision loss based on the distribution of basis vectors to maintain the corresponding speaker characteristics in generated mel-spectrograms. Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题