单发唱歌语音转换的层次扬声器表示框架

论文标题

单发唱歌语音转换的层次扬声器表示框架

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

论文作者

Li, Xu, Liu, Shansong, Shan, Ying

论文摘要

通常，唱歌语音转换（SVC）取决于嵌入向量，该向量是从扬声器查找表（LUT）或说话者识别网络（SRN）中提取的，以模型说话者的身份。但是，唱歌包含更具表现力的说话者特征，而不是对话演讲。怀疑单个嵌入矢量只能捕获平均和粗粒的说话者特征，这对于SVC任务不足。为此，这项工作提出了一个新型的SVC层级扬声器表示框架，该框架可以在不同的粒度上捕获细粒的扬声器特征。它由一个上采样流和三个下采样流组成。上采样流将语言特征转化为音频样本，而三个的一个下采样流则以反向方向运行。预计每个下采样块的时间统计数据可以代表不同粒度的说话者特征，这将参与上采样块以增强说话者的建模。实验结果验证了所提出的方法表现优于基于LUT和SRN的SVC系统。此外，提出的系统仅使用几秒钟的参考音频支持单发SVC。

Typically, singing voice conversion (SVC) depends on an embedding vector, extracted from either a speaker lookup table (LUT) or a speaker recognition network (SRN), to model speaker identity. However, singing contains more expressive speaker characteristics than conversational speech. It is suspected that a single embedding vector may only capture averaged and coarse-grained speaker characteristics, which is insufficient for the SVC task. To this end, this work proposes a novel hierarchical speaker representation framework for SVC, which can capture fine-grained speaker characteristics at different granularity. It consists of an up-sampling stream and three down-sampling streams. The up-sampling stream transforms the linguistic features into audio samples, while one down-sampling stream of the three operates in the reverse direction. It is expected that the temporal statistics of each down-sampling block can represent speaker characteristics at different granularity, which will be engaged in the up-sampling blocks to enhance the speaker modeling. Experiment results verify that the proposed method outperforms both the LUT and SRN based SVC systems. Moreover, the proposed system supports the one-shot SVC with only a few seconds of reference audio.

下载PDF全文

下载文献需遵守相关版权规定

论文标题