从原始音频中学习语音表示，联合视听自我审视

论文标题

从原始音频中学习语音表示，联合视听自我审视

Learning Speech Representations from Raw Audio by Joint Audiovisual Self-Supervision

论文作者

Shukla, Abhinav, Petridis, Stavros, Pantic, Maja

论文摘要

音频和视觉方式之间的直观相互作用对于跨模式的自学学习很有价值。该概念已被证明是针对视频动作识别和声学场景分类等通用视听任务的。但是，自觉性的言语探索仍然不足。我们提出了一种从原始音频波形中学习自我监督的语音表示的方法。我们通过将仅视听音频（通过预测信息性的音频属性）与视觉自upervision（通过从音频产生说话的面孔）相结合来训练原始音频编码器。视觉借口任务驱动音频表示以捕获与唇部运动有关的信息。这可以通过视觉信息丰富音频编码器，并且可以将编码器用于评估，而无需视觉模态。我们的方法在已建立的孤立单词分类基准上的现有自我监督的音频功能方面取得了竞争性能，并且在从更少的标签中学习方面的其他方法明显优于其他方法。值得注意的是，我们的方法还胜过完全监督的培训，从而为与语音相关的任务提供了强大的初始化。我们的结果表明，在视听语音中，多模式自我划分的潜力用于学习良好的音频表示。

The intuitive interaction between the audio and visual modalities is valuable for cross-modal self-supervised learning. This concept has been demonstrated for generic audiovisual tasks like video action recognition and acoustic scene classification. However, self-supervision remains under-explored for audiovisual speech. We propose a method to learn self-supervised speech representations from the raw audio waveform. We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio). The visual pretext task drives the audio representations to capture information related to lip movements. This enriches the audio encoder with visual information and the encoder can be used for evaluation without the visual modality. Our method attains competitive performance with respect to existing self-supervised audio features on established isolated word classification benchmarks, and significantly outperforms other methods at learning from fewer labels. Notably, our method also outperforms fully supervised training, thus providing a strong initialization for speech related tasks. Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题