论文标题
AVLNET:从教学视频中学习视听语言表示
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
论文作者
论文摘要
从视频中学习视觉扎根语言的当前方法通常依赖文本注释,例如人类生成的字幕或机器生成的自动语音识别(ASR)成绩单。在这项工作中,我们介绍了Audio-Video语言网络(AVLNET),这是一个自制的网络,可以直接从原始视频输入中学习共享的音频嵌入空间。为了规避文本注释的需求,我们从随机分段的视频剪辑及其原始音频波形中学习视听表示。我们培训AVLNET关于HowTO100M,这是一大批公开可用的教学视频,并对图像检索和视频检索任务进行了评估,从而实现了最先进的性能。我们对AVLNET学习的表示形式进行分析,显示我们的模型利用语音和自然声音来学习视听概念。此外,我们提出了一个三模式模型,该模型共同处理视频中的原始音频,视频和文本字幕,以学习一个可用于文本视频检索的多模式语义嵌入空间。我们的代码,数据和训练的模型将在avlnet.csail.mit.edu上发布。
Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the need for text annotation, we learn audio-visual representations from randomly segmented video clips and their raw audio waveforms. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks, achieving state-of-the-art performance. We perform analysis of AVLnet's learned representations, showing our model utilizes speech and natural sounds to learn audio-visual concepts. Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval. Our code, data, and trained models will be released at avlnet.csail.mit.edu