口吃-tts：受控综合和改善口吃的识别

论文标题

口吃-tts：受控综合和改善口吃的识别

Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered Speech

论文作者

Zhang, Xin, Vallés-Pérez, Iván, Stolcke, Andreas, Yu, Chengzhu, Droppo, Jasha, Shonibare, Olabanji, Barra-Chicote, Roberto, Ravichandran, Venkatesh

论文摘要

口吃是一种言语障碍，其中的自然语音流被块，重复或音节，单词和短语的延长中断。现有的大多数自动语音识别（ASR）接口在口吃的话语上的表现较差，这主要是由于缺乏匹配的培训数据。因此，用口吃的语音综合为改善这种类型的语音的ASR提供了机会。我们描述了Stutter-TTS，这是一种端到端的神经文本到语音模型，能够综合各种类型的口吃话语。我们开发了一种简单而有效的韵律控制策略，在训练过程中，将其他令牌引入源文本中，以代表特定的口吃特征。通过选择口吃令牌的位置，口吃TTS允许单词级别控制综合话语中的口吃发生。我们能够以高精度（根据口吃类型的不同）来合成口吃事件（F1得分在0.63至0.84之间）。通过微调合成口吃语音的ASR模型，我们能够将单词误差降低5.7％的相对词，而流利的话语只有轻微（<0.2％的相对）降解。

Stuttering is a speech disorder where the natural flow of speech is interrupted by blocks, repetitions or prolongations of syllables, words and phrases. The majority of existing automatic speech recognition (ASR) interfaces perform poorly on utterances with stutter, mainly due to lack of matched training data. Synthesis of speech with stutter thus presents an opportunity to improve ASR for this type of speech. We describe Stutter-TTS, an end-to-end neural text-to-speech model capable of synthesizing diverse types of stuttering utterances. We develop a simple, yet effective prosody-control strategy whereby additional tokens are introduced into source text during training to represent specific stuttering characteristics. By choosing the position of the stutter tokens, Stutter-TTS allows word-level control of where stuttering occurs in the synthesized utterance. We are able to synthesize stutter events with high accuracy (F1-scores between 0.63 and 0.84, depending on stutter type). By fine-tuning an ASR model on synthetic stuttered speech we are able to reduce word error by 5.7% relative on stuttered utterances, with only minor (<0.2% relative) degradation for fluent utterances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题