谁的情绪很重要？说话活动本地化而没有先验知识

论文标题

谁的情绪很重要？说话活动本地化而没有先验知识

Whose Emotion Matters? Speaking Activity Localisation without Prior Knowledge

论文作者

Carneiro, Hugo, Weber, Cornelius, Wermter, Stefan

论文摘要

对话（ERC）中情绪识别的任务受益于多种模式的可用性，例如，在基于视频的多模式情感线数据集（MELD）中提供了。但是，只有少数研究方法使用MELD视频中的声学和视觉信息。造成这种情况的原因有两个：首先，熔融中的标签到视频对齐是嘈杂的，使这些视频成为情感语音数据的不可靠来源。其次，对话可能涉及几个人在同一场景中，这需要讲话来源的本地化。在本文中，我们通过使用最近的主动扬声器检测和自动语音识别模型通过重组（MELD-FAIR）介绍了固定的视听信息，我们能够重新调整MELD的视频，并在96.92％的融合中提供96.92％的扬声器的面部表情。使用自我监督的语音识别模型进行的实验表明，重新调整的MELD-FAIR视频更匹配MELD数据集中给出的转录话语。最后，我们在对对话的对话中设计了一个模型，以在重新调整的MELD-FAIR视频中进行训练，该视频仅基于视觉而优于ERC的最先进模型。这表明，本地化说话活动的来源确实可以有效地从说话扬声器中提取面部表情，并且与到目前为止使用的视觉特征相比，面孔提供的视觉提示更具信息性。 https://github.com/knowledgetechnologyuhh/meld-fair可以找到MELD-FAIR重新调整数据以及重新调整程序和情感识别的代码和情感识别的代码。

The task of emotion recognition in conversations (ERC) benefits from the availability of multiple modalities, as provided, for example, in the video-based Multimodal EmotionLines Dataset (MELD). However, only a few research approaches use both acoustic and visual information from the MELD videos. There are two reasons for this: First, label-to-video alignments in MELD are noisy, making those videos an unreliable source of emotional speech data. Second, conversations can involve several people in the same scene, which requires the localisation of the utterance source. In this paper, we introduce MELD with Fixed Audiovisual Information via Realignment (MELD-FAIR) by using recent active speaker detection and automatic speech recognition models, we are able to realign the videos of MELD and capture the facial expressions from speakers in 96.92% of the utterances provided in MELD. Experiments with a self-supervised voice recognition model indicate that the realigned MELD-FAIR videos more closely match the transcribed utterances given in the MELD dataset. Finally, we devise a model for emotion recognition in conversations trained on the realigned MELD-FAIR videos, which outperforms state-of-the-art models for ERC based on vision alone. This indicates that localising the source of speaking activities is indeed effective for extracting facial expressions from the uttering speakers and that faces provide more informative visual cues than the visual features state-of-the-art models have been using so far. The MELD-FAIR realignment data, and the code of the realignment procedure and of the emotional recognition, are available at https://github.com/knowledgetechnologyuhh/MELD-FAIR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题