基于源到目标直接映射的非并行语音转换

论文标题

基于源到目标直接映射的非并行语音转换

Non-parallel voice conversion based on source-to-target direct mapping

论文作者

Jung, Sunghee, Suh, Youngjoo, Choi, Yeunju, Kim, Hoirin

论文摘要

使用语音后期图（PPG）进行非并行语音转换的最新作品显着提高了语音转换的可用性，因为不再需要匹配内容的源和目标DBS。在这种方法中，PPG被用作源和目标扬声器特征之间的语言桥。但是，这种基于PPG的非并行语音转换具有一定的限制，即在转换时间需要两个级联网络，从而使其不太适合实时应用，并且在转换阶段容易受到源扬声器清晰度的影响。为了解决此限制，我们提出了一种新的非并行语音转换技术，该技术采用单个神经网络来直接源源到目标语音参数映射。借助这种单个网络结构，提出的方法可以减少转换时间和网络参数的数量，这可能是嵌入式或实时环境中特别重要的因素。此外，它通过在转换阶段跳过电话识别器来提高语音转换的质量。它可以有效防止语音信息可能丢失基于PPG的间接方法受到的损失。实验表明，我们的方法将网络参数的数量和转换时间分别减少了41.9％和44.5％，而基于原始PPG的方法的语音相似性提高了。

Recent works of utilizing phonetic posteriograms (PPGs) for non-parallel voice conversion have significantly increased the usability of voice conversion since the source and target DBs are no longer required for matching contents. In this approach, the PPGs are used as the linguistic bridge between source and target speaker features. However, this PPG-based non-parallel voice conversion has some limitation that it needs two cascading networks at conversion time, making it less suitable for real-time applications and vulnerable to source speaker intelligibility at conversion stage. To address this limitation, we propose a new non-parallel voice conversion technique that employs a single neural network for direct source-to-target voice parameter mapping. With this single network structure, the proposed approach can reduce both conversion time and number of network parameters, which can be especially important factors in embedded or real-time environments. Additionally, it improves the quality of voice conversion by skipping the phone recognizer at conversion stage. It can effectively prevent possible loss of phonetic information the PPG-based indirect method suffers. Experiments show that our approach reduces number of network parameters and conversion time by 41.9% and 44.5%, respectively, with improved voice similarity over the original PPG-based method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题