序列到序列的唱歌语音综合，感知熵损失

论文标题

序列到序列的唱歌语音综合，感知熵损失

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

论文作者

Shi, Jiatong, Guo, Shuai, Huo, Nan, Zhang, Yuekai, Jin, Qin

论文摘要

基于神经网络（NN）的唱歌语音合成（SVS）系统需要足够的数据来训练良好，并且由于数据稀缺而容易过度拟合。但是，由于数据获取和注释成本高，我们经常在构建SVS系统中遇到数据限制问题。在这项工作中，我们提出了一种感知性熵（PE）损失，该损失是从心理声学听力模型中造成的，以使网络正常。使用一个小时的开源语音数据库，我们探讨了PE损失对各种主流序列到序列模型的影响，包括基于RNN的基于RNN，基于变压器和基于顺质器的模型。我们的实验表明，PE损失可以减轻过度拟合的问题，并显着改善客观和主观评估中反映的综合唱歌质量。

The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are prone to over-fitting due to data scarcity. However, we often encounter data limitation problem in building SVS systems because of high data acquisition and annotation costs. In this work, we propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network. With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models, including the RNN-based, transformer-based, and conformer-based models. Our experiments show that the PE loss can mitigate the over-fitting problem and significantly improve the synthesized singing quality reflected in objective and subjective evaluations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题