与位置相关序列到序列建模的任何一系列语音转换

论文标题

与位置相关序列到序列建模的任何一系列语音转换

Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling

论文作者

Liu, Songxiang, Cao, Yuewen, Wang, Disong, Wu, Xixin, Liu, Xunying, Meng, Helen

论文摘要

本文提出了一种任何对位置搭配的，序列到序列（SEQ2SEQ），非并行语音转换方法，该方法在培训过程中利用文本监督。在这种方法中，我们将瓶颈特征提取器（BNE）与SEQ2SEQ合成模块相结合。在训练阶段，对基于编码器的混合连接师 - 周期性分类（CTC意见）音素识别器进行了训练，其编码器的编码器具有瓶颈层。从音素识别器中获得BNE，并用于从光谱特征中提取扬声器独立，密集和丰富的口语表示表示。然后，训练了基于多演讲者与位置相关注意力的seq2Seq合成模型，可以从瓶颈功能中重建光谱特征，并在生成的语音中根据扬声器表示扬声器表示。为了减轻使用SEQ2SEQ模型对齐长序列的困难，我们将输入光谱特征沿时间尺寸进行下样本，并配备合成模型的逻辑（MOL）注意机制的离散混合物。由于音素识别器经过大量的语音识别数据语料库的培训，因此所提出的方法可以进行任何对抗的语音转换。客观和主观评估表明，在自然性和说话者的相似性方面，提出的任何对数量的方法都具有卓越的语音转换性能。进行消融研究是为了确认所提出方法中特征选择和模型设计策略的有效性。提出的VC方法可以轻松扩展以支持任何对任何风险投资（也称为一个/少数拍摄的VC），并根据客观和主观评估实现高性能。

This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach, which utilizes text supervision during training. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. During the training stage, an encoder-decoder-based hybrid connectionist-temporal-classification-attention (CTC-attention) phoneme recognizer is trained, whose encoder has a bottle-neck layer. A BNE is obtained from the phoneme recognizer and is utilized to extract speaker-independent, dense and rich spoken content representations from spectral features. Then a multi-speaker location-relative attention based seq2seq synthesis model is trained to reconstruct spectral features from the bottle-neck features, conditioning on speaker representations for speaker identity control in the generated speech. To mitigate the difficulties of using seq2seq models to align long sequences, we down-sample the input spectral feature along the temporal dimension and equip the synthesis model with a discretized mixture of logistic (MoL) attention mechanism. Since the phoneme recognizer is trained with large speech recognition data corpus, the proposed approach can conduct any-to-many voice conversion. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity. Ablation studies are conducted to confirm the effectiveness of feature selection and model design strategies in the proposed approach. The proposed VC approach can readily be extended to support any-to-any VC (also known as one/few-shot VC), and achieve high performance according to objective and subjective evaluations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题