TTS模型选择的ASR指导语音可理解度措施

论文标题

TTS模型选择的ASR指导语音可理解度措施

An ASR Guided Speech Intelligibility Measure for TTS Model Selection

论文作者

Baby, Arun, Vinnaitherthan, Saranya, Adiga, Nagaraj, Jawale, Pranav, Badam, Sumukh, Adavanne, Sharath, Konjeti, Srikanth

论文摘要

神经文本到语音（TTS）的感知质量高度取决于训练过程中模型的选择。使用训练对象指标（例如最小平方误差）选择模型并不总是与人类的感知相关。在本文中，我们提出了一个基于电话错误率（PER）的客观度量，以选择具有最佳语音清晰度的TTS模型。 Per是在TTS模型的输入文本之间计算的，并且使用自动语音识别（ASR）模型从合成语音解码的文本进行了计算，该文本与TTS模型相同的数据进行了训练。在主观研究的帮助下，我们表明，与训练对象指标损失最少的模型相比，以最少的验证分配选择的TTS模型具有明显更高的语音清晰度。最后，使用建议的PER和主观评估，我们表明最佳TTS模型的选择取决于目标域文本的流派。我们所有的实验均在印地语语言数据集上进行。但是，提出的模型选择方法是独立的。

The perceptual quality of neural text-to-speech (TTS) is highly dependent on the choice of the model during training. Selecting the model using a training-objective metric such as the least mean squared error does not always correlate with human perception. In this paper, we propose an objective metric based on the phone error rate (PER) to select the TTS model with the best speech intelligibility. The PER is computed between the input text to the TTS model, and the text decoded from the synthesized speech using an automatic speech recognition (ASR) model, which is trained on the same data as the TTS model. With the help of subjective studies, we show that the TTS model chosen with the least PER on validation split has significantly higher speech intelligibility compared to the model with the least training-objective metric loss. Finally, using the proposed PER and subjective evaluation, we show that the choice of best TTS model depends on the genre of the target domain text. All our experiments are conducted on a Hindi language dataset. However, the proposed model selection method is language independent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题