端到端的语音识别和撤离

论文标题

端到端的语音识别和撤离

End-to-End Speech Recognition and Disfluency Removal

论文作者

Lou, Paria Jamshid, Johnson, Mark

论文摘要

通常在自动语音识别（ASR）系统和下游任务之间进行反射检测是一个中间步骤。相比之下，本文旨在调查端到端语音识别和消除局面的任务。我们专门探讨了是否可以训练ASR模型直接映射不变的语音为流利的转录本，而无需依赖单独的差异检测模型。我们表明，端到端模型确实学会了直接生成流利的成绩单。但是，它们的性能比由ASR系统和反射检测模型组成的基线管道方法稍差。我们还提出了两个新的指标，可用于评估综合的ASR和反射模型。本文的发现可以作为对未来端到端语音识别和消除局部疏忽任务的进一步研究的基准。

Disfluency detection is usually an intermediate step between an automatic speech recognition (ASR) system and a downstream task. By contrast, this paper aims to investigate the task of end-to-end speech recognition and disfluency removal. We specifically explore whether it is possible to train an ASR model to directly map disfluent speech into fluent transcripts, without relying on a separate disfluency detection model. We show that end-to-end models do learn to directly generate fluent transcripts; however, their performance is slightly worse than a baseline pipeline approach consisting of an ASR system and a disfluency detection model. We also propose two new metrics that can be used for evaluating integrated ASR and disfluency models. The findings of this paper can serve as a benchmark for further research on the task of end-to-end speech recognition and disfluency removal in the future.

下载PDF全文

下载文献需遵守相关版权规定

论文标题