论文标题
单发唱歌语音转换的层次扬声器表示框架
A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion
论文作者
论文摘要
通常,唱歌语音转换(SVC)取决于嵌入向量,该向量是从扬声器查找表(LUT)或说话者识别网络(SRN)中提取的,以模型说话者的身份。但是,唱歌包含更具表现力的说话者特征,而不是对话演讲。怀疑单个嵌入矢量只能捕获平均和粗粒的说话者特征,这对于SVC任务不足。为此,这项工作提出了一个新型的SVC层级扬声器表示框架,该框架可以在不同的粒度上捕获细粒的扬声器特征。它由一个上采样流和三个下采样流组成。上采样流将语言特征转化为音频样本,而三个的一个下采样流则以反向方向运行。预计每个下采样块的时间统计数据可以代表不同粒度的说话者特征,这将参与上采样块以增强说话者的建模。实验结果验证了所提出的方法表现优于基于LUT和SRN的SVC系统。此外,提出的系统仅使用几秒钟的参考音频支持单发SVC。
Typically, singing voice conversion (SVC) depends on an embedding vector, extracted from either a speaker lookup table (LUT) or a speaker recognition network (SRN), to model speaker identity. However, singing contains more expressive speaker characteristics than conversational speech. It is suspected that a single embedding vector may only capture averaged and coarse-grained speaker characteristics, which is insufficient for the SVC task. To this end, this work proposes a novel hierarchical speaker representation framework for SVC, which can capture fine-grained speaker characteristics at different granularity. It consists of an up-sampling stream and three down-sampling streams. The up-sampling stream transforms the linguistic features into audio samples, while one down-sampling stream of the three operates in the reverse direction. It is expected that the temporal statistics of each down-sampling block can represent speaker characteristics at different granularity, which will be engaged in the up-sampling blocks to enhance the speaker modeling. Experiment results verify that the proposed method outperforms both the LUT and SRN based SVC systems. Moreover, the proposed system supports the one-shot SVC with only a few seconds of reference audio.