无需配对培训数据，可以从语音识别命名实体识别的端到端模型

论文标题

无需配对培训数据，可以从语音识别命名实体识别的端到端模型

End-to-end model for named entity recognition from speech without paired training data

论文作者

Mdhaffar, Salima, Duret, Jarod, Parcollet, Titouan, Estève, Yannick

论文摘要

最近的作品表明，端到端的神经方法对于口语理解（SLU）往往变得非常流行。通过端到端术语，人们考虑了直接从语音信号中提取语义信息的单个模型的使用。此类模型的一个主要问题是缺乏具有语义注释的配对音频和文本数据。在本文中，我们提出了一种构建端到端神经模型的方法，以在可用的零配对音频数据的情况下提取语义信息。我们的方法基于使用经过训练的外部模型来生成文本中的矢量表示顺序。这些表示模仿可以通过处理语音信号在端到端自动语音识别（ASR）模型中生成的隐藏表示形式。然后使用这些表示形式作为输入和带注释的文本作为输出对SLU神经模块进行训练。最后，SLU模块代替了ASR模型的顶层，以实现端到端模型的构建。我们对在Quaero语料库上进行的命名实体识别的实验表明，这种方法非常有前途，比可比的级联方法或使用合成声音更好。

Recent works showed that end-to-end neural approaches tend to become very popular for spoken language understanding (SLU). Through the term end-to-end, one considers the use of a single model optimized to extract semantic information directly from the speech signal. A major issue for such models is the lack of paired audio and textual data with semantic annotation. In this paper, we propose an approach to build an end-to-end neural model to extract semantic information in a scenario in which zero paired audio data is available. Our approach is based on the use of an external model trained to generate a sequence of vectorial representations from text. These representations mimic the hidden representations that could be generated inside an end-to-end automatic speech recognition (ASR) model by processing a speech signal. An SLU neural module is then trained using these representations as input and the annotated text as output. Last, the SLU module replaces the top layers of the ASR model to achieve the construction of the end-to-end model. Our experiments on named entity recognition, carried out on the QUAERO corpus, show that this approach is very promising, getting better results than a comparable cascade approach or than the use of synthetic voices.

下载PDF全文

下载文献需遵守相关版权规定

论文标题