使用多任务学习和扬声器分类器联合培训的跨语性文本到语音

论文标题

使用多任务学习和扬声器分类器联合培训的跨语性文本到语音

Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training

论文作者

Yang, J., He, Lei

论文摘要

在跨语性的语音综合中，可以用各种语言的语言为单胶条扬声器综合。通常，只有单胶说话者的数据可用于模型培训，因此，在合成的跨语言语音和本地语言记录之间，说话者的相似性相对较低。基于多语言变压器文本到语音模型，本文研究了一个多任务学习框架，以提高跨语性扬声器的相似性。为了进一步提高说话者的相似性，提出了与说话者分类器的联合培训。在这里，提出了类似于平行计划的采样的方案，以有效地训练变压器模型，以避免在引入关节训练时打破并行的训练机制。通过使用多任务学习和演讲者分类器联合培训，在主观和客观评估中，在培训集中，可见的和看不见的扬声器都可以一致地提高跨语言扬声器的相似性。

In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low between the synthesized cross-lingual speech and the native language recordings. Based on the multilingual transformer text-to-speech model, this paper studies a multi-task learning framework to improve the cross-lingual speaker similarity. To further improve the speaker similarity, joint training with a speaker classifier is proposed. Here, a scheme similar to parallel scheduled sampling is proposed to train the transformer model efficiently to avoid breaking the parallel training mechanism when introducing joint training. By using multi-task learning and speaker classifier joint training, in subjective and objective evaluations, the cross-lingual speaker similarity can be consistently improved for both the seen and unseen speakers in the training set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题