论文标题
通过双标签蒙面的语音预测进行多对话者语音识别
Self-supervised learning with bi-label masked speech prediction for streaming multi-talker speech recognition
论文作者
论文摘要
利用输入数据本身进行表示学习的自我监督学习(SSL)已为各种下游语音任务取得了最新的结果。但是,以前的大多数研究都集中在离线单言语应用程序上,在多对言语案例中进行了有限的研究,尤其是对于流媒体方案。在本文中,我们调查了用于流媒体语音识别的SSL,该识别以流式传输方式产生重叠的扬声器的转录。我们首先观察到,由于语音重叠的代表不佳,传统的SSL技术在这项任务上不能很好地工作。然后,我们提出了一个新颖的SSL训练目标,称为双标签蒙面的语音预测,该预测明确保留了所有说话者在重叠的语音中的表示。我们研究了提出的系统的各个方面,包括数据配置和量化器选择。提出的SSL设置在LibrisPeechMix数据集上实现了更好的单词错误率。
Self-supervised learning (SSL), which utilizes the input data itself for representation learning, has achieved state-of-the-art results for various downstream speech tasks. However, most of the previous studies focused on offline single-talker applications, with limited investigations in multi-talker cases, especially for streaming scenarios. In this paper, we investigate SSL for streaming multi-talker speech recognition, which generates transcriptions of overlapping speakers in a streaming fashion. We first observe that conventional SSL techniques do not work well on this task due to the poor representation of overlapping speech. We then propose a novel SSL training objective, referred to as bi-label masked speech prediction, which explicitly preserves representations of all speakers in overlapping speech. We investigate various aspects of the proposed system including data configuration and quantizer selection. The proposed SSL setup achieves substantially better word error rates on the LibriSpeechMix dataset.