论文标题

序列到序列的唱歌语音综合,感知熵损失

Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss

论文作者

Shi, Jiatong, Guo, Shuai, Huo, Nan, Zhang, Yuekai, Jin, Qin

论文摘要

基于神经网络(NN)的唱歌语音合成(SVS)系统需要足够的数据来训练良好,并且由于数据稀缺而容易过度拟合。但是,由于数据获取和注释成本高,我们经常在构建SVS系统中遇到数据限制问题。在这项工作中,我们提出了一种感知性熵(PE)损失,该损失是从心理声学听力模型中造成的,以使网络正常。使用一个小时的开源语音数据库,我们探讨了PE损失对各种主流序列到序列模型的影响,包括基于RNN的基于RNN,基于变压器和基于顺质器的模型。我们的实验表明,PE损失可以减轻过度拟合的问题,并显着改善客观和主观评估中反映的综合唱歌质量。

The neural network (NN) based singing voice synthesis (SVS) systems require sufficient data to train well and are prone to over-fitting due to data scarcity. However, we often encounter data limitation problem in building SVS systems because of high data acquisition and annotation costs. In this work, we propose a Perceptual Entropy (PE) loss derived from a psycho-acoustic hearing model to regularize the network. With a one-hour open-source singing voice database, we explore the impact of the PE loss on various mainstream sequence-to-sequence models, including the RNN-based, transformer-based, and conformer-based models. Our experiments show that the PE loss can mitigate the over-fitting problem and significantly improve the synthesized singing quality reflected in objective and subjective evaluations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源