论文标题
Retievertts:建模基于文本的语音插入的分解因素
RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion
论文作者
论文摘要
本文为基于文本的语音插入任务提出了一个新的“分解和编辑”范式,以促进任意长度的语音插入甚至完整的句子产生。在拟议的范式中,言语中的全球和地方因素被明确分解并单独操纵,以实现高扬声器的相似性和连续的韵律。具体而言,我们建议通过多个令牌来表示全局因素,这些代币通过交叉注意操作提取,然后通过链接注意操作向后注入。由于全球因素的丰富代表性,我们设法以零拍的方式实现了高扬声器的相似性。此外,我们引入了一项韵律平滑任务,以使局部韵律因素上下文感知并获得令人满意的韵律连续性。通过对抗性训练阶段,我们进一步达到了高音质量。在主观测试中,我们的方法在自然性和相似性上都达到了最先进的表现。可以在https://ydcustc.github.io/retrievertts-demo/上找到音频样本。
This paper proposes a new "decompose-and-edit" paradigm for the text-based speech insertion task that facilitates arbitrary-length speech insertion and even full sentence generation. In the proposed paradigm, global and local factors in speech are explicitly decomposed and separately manipulated to achieve high speaker similarity and continuous prosody. Specifically, we proposed to represent the global factors by multiple tokens, which are extracted by cross-attention operation and then injected back by link-attention operation. Due to the rich representation of global factors, we manage to achieve high speaker similarity in a zero-shot manner. In addition, we introduce a prosody smoothing task to make the local prosody factor context-aware and therefore achieve satisfactory prosody continuity. We further achieve high voice quality with an adversarial training stage. In the subjective test, our method achieves state-of-the-art performance in both naturalness and similarity. Audio samples can be found at https://ydcustc.github.io/retrieverTTS-demo/.