准周期性象征：具有音高依赖性扩张神经网络的自回归原始波形生成模型

论文标题

准周期性象征：具有音高依赖性扩张神经网络的自回归原始波形生成模型

Quasi-Periodic WaveNet: An Autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

论文作者

Wu, Yi-Chiao, Hayashi, Tomoki, Tobing, Patrick Lumban, Kobayashi, Kazuhiro, Toda, Tomoki

论文摘要

在本文中，提出了一种名为Quasi-Priodic Wavenet（QPNET）的俯仰自适应波形生成模型，以使用依赖性依赖性扩张的卷积神经网络（PDCNN）来提高香草波纳特（WN）的有限螺距可控性。具体而言，作为具有堆叠扩张的卷积层的概率自回归产生模型，WN可实现高保真音频波形的产生。但是，纯数据驱动的性质以及缺乏音频信号的知识降低了WN的音高可控性。例如，当给定的辅助基本频率（$ f_ {0} $）功能在训练数据中观察到的$ f_ {0} $范围之外时，WN很难精确地生成音频信号的定期组件。为了解决这个问题，提出了两个新型设计的QPNET。首先，将PDCNN组件应用于根据给定的辅助$ f_ {0} $功能动态更改WN的网络体系结构。其次，使用级联的网络结构同时建模准周期信号（例如语音）的长期和短期依赖性。评估了单色调和语音世代的性能。实验结果表明，PDCNNS对看不见的辅助$ f_ {0} $特征的有效性以及级联结构对语音生成的有效性。

In this paper, a pitch-adaptive waveform generative model named Quasi-Periodic WaveNet (QPNet) is proposed to improve the limited pitch controllability of vanilla WaveNet (WN) using pitch-dependent dilated convolution neural networks (PDCNNs). Specifically, as a probabilistic autoregressive generation model with stacked dilated convolution layers, WN achieves high-fidelity audio waveform generation. However, the pure-data-driven nature and the lack of prior knowledge of audio signals degrade the pitch controllability of WN. For instance, it is difficult for WN to precisely generate the periodic components of audio signals when the given auxiliary fundamental frequency ($F_{0}$) features are outside the $F_{0}$ range observed in the training data. To address this problem, QPNet with two novel designs is proposed. First, the PDCNN component is applied to dynamically change the network architecture of WN according to the given auxiliary $F_{0}$ features. Second, a cascaded network structure is utilized to simultaneously model the long- and short-term dependencies of quasi-periodic signals such as speech. The performances of single-tone sinusoid and speech generations are evaluated. The experimental results show the effectiveness of the PDCNNs for unseen auxiliary $F_{0}$ features and the effectiveness of the cascaded structure for speech generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题