论文标题
SPLAT:语言联合预培训,用于口语理解
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding
论文作者
论文摘要
口语理解(SLU)需要一个模型来分析输入声学信号以了解其语言内容并做出预测。为了提高模型的性能,已经提出了各种预训练方法,以从大规模的未经大规模的语音和文本中学习丰富的表示。但是,两种方式之间的固有差异需要相互分析。在本文中,我们提出了一个新颖的半监督学习框架Splat,以共同预先培训语音和语言模块。除了使用未配对的语音和文本对两个单个模块上进行自我监督的掩盖语言建模任务外,SPLAT还使用少量的配对语音和文本在共享潜在空间中的两个模块中对齐表示。因此,在微调过程中,单独的语音模块可以产生携带声学信息的表示和对输入声信号的上下文语义知识。实验结果验证了我们方法对各种SLU任务的有效性。例如,SPLAT将以前的口语数据集上的先前最新性能提高了10%以上。
Spoken language understanding (SLU) requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. To boost the models' performance, various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. However, the inherent disparities between the two modalities necessitate a mutual analysis. In this paper, we propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules. Besides conducting a self-supervised masked language modeling task on the two individual modules using unpaired speech and text, SPLAT aligns representations from the two modules in a shared latent space using a small amount of paired speech and text. Thus, during fine-tuning, the speech module alone can produce representations carrying both acoustic information and contextual semantic knowledge of an input acoustic signal. Experimental results verify the effectiveness of our approach on various SLU tasks. For example, SPLAT improves the previous state-of-the-art performance on the Spoken SQuAD dataset by more than 10%.