强大的自我观察的音频语音识别

论文标题

强大的自我观察的音频语音识别

Robust Self-Supervised Audio-Visual Speech Recognition

论文作者

Shi, Bowen, Hsu, Wei-Ning, Mohamed, Abdelrahman

论文摘要

基于音频的自动语音识别（ASR）在嘈杂的环境中显着降低，并且特别容易干扰语音，因为该模型无法确定要转录的说话者。视听语音识别（AVSR）系统通过使用噪声不变的视觉信息补充音频流来改善鲁棒性，并帮助模型专注于所需的扬声器。但是，以前的AVSR工作仅着眼于监督的学习设置。因此，可用的标记数据量阻碍了进度。在这项工作中，我们提出了一个基于视听Hubert（AV-Hubert）建立的自我监督的AVSR框架，这是一种最先进的音频语音表示模型。在最大的AVSR基准数据集LRS3上，我们的方法使用少于10％的标记数据（4333hr vs. 30hr）在噪音的情况下优于先前的最先进约50％（28.0％vs.14.1％），而基于audio的噪声的存在，同时将基于听觉的模型的模型降低了75％（25.8％）（25.8％）。1。8％（25.8％）。

Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.

下载PDF全文

下载文献需遵守相关版权规定

论文标题