论文标题
通过验证的语言模型来了解口语理解的数据增强
Data Augmentation for Spoken Language Understanding via Pretrained Language Models
论文作者
论文摘要
口语理解(SLU)模型的培训通常面临数据稀缺问题。在本文中,我们使用验证的语言模型提出了一种数据增强方法,以提高产生的话语的可变性和准确性。此外,我们调查并提出解决方案,以解决两个以前被忽视的SLU中数据稀缺的半监督学习方案:i)富裕学:具有许多有效的对话行为的本体论信息; ii)丰富的物质:有大量未标记的话语可用。经验结果表明,我们的方法可以产生合成训练数据,从而在各种情况下提高语言理解模型的性能。
The training of spoken language understanding (SLU) models often faces the problem of data scarcity. In this paper, we put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances. Furthermore, we investigate and propose solutions to two previously overlooked semi-supervised learning scenarios of data scarcity in SLU: i) Rich-in-Ontology: ontology information with numerous valid dialogue acts is given; ii) Rich-in-Utterance: a large number of unlabelled utterances are available. Empirical results show that our method can produce synthetic training data that boosts the performance of language understanding models in various scenarios.