论文标题
自我监督的语音表示学习:评论
Self-Supervised Speech Representation Learning: A Review
论文作者
论文摘要
尽管受到监督的深度学习彻底改变了语音和音频处理,但它必须为个人任务和应用程序方案建立专业模型。同样,很难将其应用于只有有限标记数据的方言和语言。自我监督的表示方式学习方法承诺一个单一的通用模型,该模型将使各种各样的任务和领域受益。这种方法显示了自然语言处理和计算机视觉域的成功,在减少许多下游场景所需的标签数量的同时,达到了新的性能水平。语音表示学习在三个主要类别中也经历了类似的进展:生成,对比和预测方法。其他方法依赖于多模式数据,用于预训练,将文本或视觉数据流与语音混合。尽管自我监督的语音表示仍然是一个新生的研究领域,但它与用零词汇资源的声学单词嵌入和学习密切相关,这两者都在多年来都进行了积极的研究。这篇评论介绍了自我监督的语音表示学习及其与其他研究领域的联系的方法。由于许多当前的方法仅着眼于自动语音识别作为下游任务,因此我们回顾了基准测试的最新努力,以将应用程序扩展到语音识别之外。
Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years. This review presents approaches for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, we review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.