论文标题
Wabert:一种低资源的端到端模型,用于语言理解和语音对准词
WaBERT: A Low-resource End-to-end Model for Spoken Language Understanding and Speech-to-BERT Alignment
论文作者
论文摘要
从历史上讲,诸如自动语音识别(ASR)和说话者身份等较低级别的任务是语音领域的主要重点。最近,兴趣在高级口语理解(SLU)任务中一直在增长,例如情感分析(SA)。但是,改善SLU任务的表现仍然是一个巨大的挑战。基本上,SLU任务有两种主要方法:(1)使用语音模型将语音转移到文本的两个阶段方法,然后使用语言模型来获取下游任务的结果; (2)单阶段方法,仅微调预先训练的语音模型以适合下游任务。第一种方法失去了情感线索,例如语调,并在ASR过程中导致识别错误,第二种方法缺乏必要的语言知识。在本文中,我们提出了Wave Bert(Wabert),这是一种新颖的端到端模型,结合了语音模型和SLU任务的语言模型。 Wabert基于预先训练的语音和语言模型,因此不需要从头开始培训。我们还将Wabert冻结的大多数参数设置为训练。通过引入Wabert,将特定于音频的信息和语言知识集成到短时和低资源培训过程中,以将SLUE SA任务的开发数据集提高召回率分数的1.15%,占F1分数的0.82%。此外,我们修改了串行连续集成与火(CIF)机制,以实现语音和文本方式之间的单调比对。
Historically lower-level tasks such as automatic speech recognition (ASR) and speaker identification are the main focus in the speech field. Interest has been growing in higher-level spoken language understanding (SLU) tasks recently, like sentiment analysis (SA). However, improving performances on SLU tasks remains a big challenge. Basically, there are two main methods for SLU tasks: (1) Two-stage method, which uses a speech model to transfer speech to text, then uses a language model to get the results of downstream tasks; (2) One-stage method, which just fine-tunes a pre-trained speech model to fit in the downstream tasks. The first method loses emotional cues such as intonation, and causes recognition errors during ASR process, and the second one lacks necessary language knowledge. In this paper, we propose the Wave BERT (WaBERT), a novel end-to-end model combining the speech model and the language model for SLU tasks. WaBERT is based on the pre-trained speech and language model, hence training from scratch is not needed. We also set most parameters of WaBERT frozen during training. By introducing WaBERT, audio-specific information and language knowledge are integrated in the short-time and low-resource training process to improve results on the dev dataset of SLUE SA tasks by 1.15% of recall score and 0.82% of F1 score. Additionally, we modify the serial Continuous Integrate-and-Fire (CIF) mechanism to achieve the monotonic alignment between the speech and text modalities.