论文标题
富有表现力-VC:高度表现力的语音转换,瓶颈和扰动功能的注意力融合
Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features
论文作者
论文摘要
高度表达语音的语音转换具有挑战性。当前的方法与说话者的相似性,清晰度和表现力之间的平衡斗争。为了解决这个问题,我们提出了富有表现力的VC,这是一种新颖的端到端语音转换框架,利用神经瓶颈功能(BNF)方法和信息扰动方法利用了优势。具体而言,我们使用BNF编码器和一个扰动的波动编码器来形成内容提取器,以分别学习语言和para语言特征,其中BNF来自强大的预先训练的ASR模型,并且扰动的波浪在信号扰动后变为扬声器 - 扬声器irrretervant。我们通过注意机制进一步融合了语言和偏见的语言特征,在该机制中,采用了依赖说话者的韵律特征作为注意查询,这是由韵律编码器带有目标扬声器的韵律编码器,嵌入了目标扬声器,源语音的归一化音调和能量作为输入。最后,解码器消耗了集成的功能和依赖说话者的韵律功能来生成转换后的语音。实验表明,表达式VC优于几个最先进的系统,从源语音中获得的高表现力和与目标扬声器的高扬声器相似性相似。同时,可以很好地维护清晰度。
Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balancing between speaker similarity, intelligibility and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both neural bottleneck feature (BNF) approach and information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are adopted as the attention query, which result from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments demonstrate that Expressive-VC is superior to several state-of-the-art systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained.