扬声器有条件的Wavernn：迈向普遍的神经声码器，用于看不见的扬声器和记录条件

论文标题

扬声器有条件的Wavernn：迈向普遍的神经声码器，用于看不见的扬声器和记录条件

Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions

论文作者

Paul, Dipjyoti, Pantazis, Yannis, Stylianou, Yannis

论文摘要

深度学习的最新进展导致了单扬声器语音综合的人类水平表现。但是，将这些系统概括为多扬声器模型时，特别是对于看不见的说话者和看不见的记录质量时，语音质量仍然存在局限性。例如，传统的神经声码编码器被调整为培训扬声器，并且具有较差的概括能力来看不见说话者。在这项工作中，我们提出了Wavernn的一种变体，称为扬声器有条件的Wavernn（SC-Wavernn）。即使在看不见的说话者和记录条件下，我们还是针对有效的通用声音编码器的开发。与标准Wavernn相反，SC-Wavernn利用了以扬声器嵌入形式给出的其他信息。 SC-Wavernn使用公开可用的数据进行培训，在主观和客观指标上，基线Wavernn的性能明显更好。在MOS中，SC-Wavernn可在可见的扬声器和可见的记录状态，在看不见的说话者和看不见的情况下，可提高约23％。最后，我们通过实现类似于零摄说话者适应的多演讲者文本到语音（TTS）综合来扩展工作。在性能方面，我们的系统优先于基线TTS系统，在15.5％以上的60％和32.6％以上的60.9％，分别为可见的和看不见的说话者。

Recent advancements in deep learning led to human-level performance in single-speaker speech synthesis. However, there are still limitations in terms of speech quality when generalizing those systems into multiple-speaker models especially for unseen speakers and unseen recording qualities. For instance, conventional neural vocoders are adjusted to the training speaker and have poor generalization capabilities to unseen speakers. In this work, we propose a variant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN). We target towards the development of an efficient universal vocoder even for unseen speakers and recording conditions. In contrast to standard WaveRNN, SC-WaveRNN exploits additional information given in the form of speaker embeddings. Using publicly-available data for training, SC-WaveRNN achieves significantly better performance over baseline WaveRNN on both subjective and objective metrics. In MOS, SC-WaveRNN achieves an improvement of about 23% for seen speaker and seen recording condition and up to 95% for unseen speaker and unseen condition. Finally, we extend our work by implementing a multi-speaker text-to-speech (TTS) synthesis similar to zero-shot speaker adaptation. In terms of performance, our system has been preferred over the baseline TTS system by 60% over 15.5% and by 60.9% over 32.6%, for seen and unseen speakers, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题