封闭的多模式融合与对比度学习，以进行人手对话中的转弯预测

论文标题

封闭的多模式融合与对比度学习，以进行人手对话中的转弯预测

Gated Multimodal Fusion with Contrastive Learning for Turn-taking Prediction in Human-robot Dialogue

论文作者

Yang, Jiudong, Wang, Peiying, Zhu, Yi, Feng, Mingchao, Chen, Meng, He, Xiaodong

论文摘要

转弯，旨在决定何时开始讲话的何时开始说话，这是构建人类机器人口语对话系统的重要组成部分。先前的研究表明，多模式提示可以促进这项具有挑战性的任务。但是，由于公共多模式数据集的匮乏，当前方法主要仅限于使用单峰功能或简单的多模式集合模型。此外，在实际情况下，固有的阶级失衡，例如以短暂停顿结尾的句子将主要被视为转弯的终结，也对转弯决定构成了巨大的挑战。在本文中，我们首先收集了一个大规模的注释语料库，用于转弯，以5,000多个语音和文本方式进行了5,000多个真实的人类机器人对话。然后，设计了一种新型的封闭式多模式融合机制，可以将各种信息无缝用于转向预测。更重要的是，为了解决数据不平衡问题，我们设计了一种简单而有效的数据增强方法，可以在不监督的情况下构建负面实例并应用对比度学习以获得更好的功能表示。进行了广泛的实验，结果证明了我们模型比几个最先进的基线的优势和竞争力。

Turn-taking, aiming to decide when the next speaker can start talking, is an essential component in building human-robot spoken dialogue systems. Previous studies indicate that multimodal cues can facilitate this challenging task. However, due to the paucity of public multimodal datasets, current methods are mostly limited to either utilizing unimodal features or simplistic multimodal ensemble models. Besides, the inherent class imbalance in real scenario, e.g. sentence ending with short pause will be mostly regarded as the end of turn, also poses great challenge to the turn-taking decision. In this paper, we first collect a large-scale annotated corpus for turn-taking with over 5,000 real human-robot dialogues in speech and text modalities. Then, a novel gated multimodal fusion mechanism is devised to utilize various information seamlessly for turn-taking prediction. More importantly, to tackle the data imbalance issue, we design a simple yet effective data augmentation method to construct negative instances without supervision and apply contrastive learning to obtain better feature representations. Extensive experiments are conducted and the results demonstrate the superiority and competitiveness of our model over several state-of-the-art baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题