语音转换通过级联自动语音识别和韵律转移的文本到语音综合

论文标题

语音转换通过级联自动语音识别和韵律转移的文本到语音综合

Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer

论文作者

Zhang, Jing-Xuan, Liu, Li-Juan, Chen, Yan-Nian, Hu, Ya-Jun, Jiang, Yuan, Ling, Zhen-Hua, Dai, Li-Rong

论文摘要

随着自动语音识别（ASR）和文本到语音综合（TTS）技术的发展，通过级联ASR和TTS系统来构建语音转换系统是直观的。在本文中，我们提出了一种用于语音转换的ASR-TTS方法，该方法使用Iflytek ASR引擎将源语音抄录到文本中，并带有WaveNet Vocoder的变压器TTS模型，以合成从解码文本中转换后的语音。对于TTS模型，我们建议使用韵律代码来描述语音中包含的文本和说话者信息以外的韵律信息。韵律编码器用于提取韵律代码。在转换过程中，源韵律通过使用代码调节变压器TTS模型将转换为转换的语音。进行了实验以证明我们提出的方法的有效性。我们的系统还获得了2020年语音转换挑战的单语言任务中的最佳自然性和相似性。

With the development of automatic speech recognition (ASR) and text-to-speech synthesis (TTS) technique, it's intuitive to construct a voice conversion system by cascading an ASR and TTS system. In this paper, we present a ASR-TTS method for voice conversion, which used iFLYTEK ASR engine to transcribe the source speech into text and a Transformer TTS model with WaveNet vocoder to synthesize the converted speech from the decoded text. For the TTS model, we proposed to use a prosody code to describe the prosody information other than text and speaker information contained in speech. A prosody encoder is used to extract the prosody code. During conversion, the source prosody is transferred to converted speech by conditioning the Transformer TTS model with its code. Experiments were conducted to demonstrate the effectiveness of our proposed method. Our system also obtained the best naturalness and similarity in the mono-lingual task of Voice Conversion Challenge 2020.

下载PDF全文

下载文献需遵守相关版权规定

论文标题