论文标题
对抗性的多任务学习,用于散开音色和音调的声音综合
Adversarial Multi-Task Learning for Disentangling Timbre and Pitch in Singing Voice Synthesis
论文作者
论文摘要
最近,已经引入了基于深度学习的生成模型来产生歌声。一种方法是预测由显式语音参数组成的参数Vocoder功能。这种方法的优点是明确区分每个特征的含义。另一种方法是预测神经声码器的MEL光谱图。但是,参数声音编码器具有语音质量的局限性,由于音色和音高信息是纠缠的,因此很难建模MEL光谱图。在这项研究中,我们提出了一个具有多任务学习使用两种方法的歌声综合模型 - 用于参数辅助声码器的声学特征和用于神经声码器的MEL光谱图。通过将参数Vocoder功能用作辅助特征,提出的模型可以有效地解开和控制MEL光谱图的音色和螺距成分。此外,还采用生成的对抗网络框架来提高多弹奏模型中的歌声质量。实验结果表明,我们所提出的模型比单任务模型可以产生更多的自然歌声,同时比传统的基于基于参数的Vocoder模型更好。
Recently, deep learning-based generative models have been introduced to generate singing voices. One approach is to predict the parametric vocoder features consisting of explicit speech parameters. This approach has the advantage that the meaning of each feature is explicitly distinguished. Another approach is to predict mel-spectrograms for a neural vocoder. However, parametric vocoders have limitations of voice quality and the mel-spectrogram features are difficult to model because the timbre and pitch information are entangled. In this study, we propose a singing voice synthesis model with multi-task learning to use both approaches -- acoustic features for a parametric vocoder and mel-spectrograms for a neural vocoder. By using the parametric vocoder features as auxiliary features, the proposed model can efficiently disentangle and control the timbre and pitch components of the mel-spectrogram. Moreover, a generative adversarial network framework is applied to improve the quality of singing voices in a multi-singer model. Experimental results demonstrate that our proposed model can generate more natural singing voices than the single-task models, while performing better than the conventional parametric vocoder-based model.