关于单词错误率定义及其对多演讲者语音识别系统的有效计算

论文标题

关于单词错误率定义及其对多演讲者语音识别系统的有效计算

On Word Error Rate Definitions and their Efficient Computation for Multi-Speaker Speech Recognition Systems

论文作者

von Neumann, Thilo, Boeddeker, Christoph, Kinoshita, Keisuke, Delcroix, Marc, Haeb-Umbach, Reinhold

论文摘要

我们提出了一个通用框架，以计算ASR系统的单词错误率（WER），该系统在输入中处理包含多个扬声器的记录并产生多个输出单词序列（MIMO）。通常需要这样的ASR系统，例如用于满足转录。我们在多维Levenshtein距离张量中基于动态编程搜索提供了有效的实现，这是在约束下必须与一个假设输出始终匹配的参考话语。这也导致有效实施兽人先前遭受了指数复杂性的影响。我们概述了用于多演讲者场景的常用定义，并表明它们是上述MIMO的专门研究，以调整到特定的应用程序方案。最后，我们讨论了各种定义的利弊，以及何时使用的建议。

We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distance tensor under the constraint that a reference utterance must be matched consistently with one hypothesis output. This also results in an efficient implementation of the ORC WER which previously suffered from exponential complexity. We give an overview of commonly used WER definitions for multi-speaker scenarios and show that they are specializations of the above MIMO WER tuned to particular application scenarios. We conclude with a discussion of the pros and cons of the various WER definitions and a recommendation when to use which.

下载PDF全文

下载文献需遵守相关版权规定

论文标题