论文标题
朝向散文表示
Towards Disentangled Speech Representations
论文作者
论文摘要
仔细的音频表示形式已成为许多语音任务方法设计中的主要特征。这种方法越来越强调“解开”,其中表示形式仅包含与转录相关的一部分,同时丢弃无关的信息。在本文中,我们基于ASR和TTS的联合建模构建了一项表示的学习任务,并试图学习音频的表示,该声音信号的一部分与该部分相关的一部分,而不是该部分。我们提供了经验证据,表明成功找到这种表示形式与训练中固有的随机性有关。然后,我们观察到,这些所需的,分解的解决方案对优化问题具有独特的统计特性。最后,我们表明,在训练期间执行这些特性会使我们的联合建模任务平均相对24.5%。这些观察结果激发了一种新型的学习有效音频表示的方法。
The careful construction of audio representations has become a dominant feature in the design of approaches to many speech tasks. Increasingly, such approaches have emphasized "disentanglement", where a representation contains only parts of the speech signal relevant to transcription while discarding irrelevant information. In this paper, we construct a representation learning task based on joint modeling of ASR and TTS, and seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We present empirical evidence that successfully finding such a representation is tied to the randomness inherent in training. We then make the observation that these desired, disentangled solutions to the optimization problem possess unique statistical properties. Finally, we show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task. These observations motivate a novel approach to learning effective audio representations.