注意驱动的融合多模式情绪识别

论文标题

注意驱动的融合多模式情绪识别

Attention Driven Fusion for Multi-Modal Emotion Recognition

论文作者

Priyasad, Darshana, Fernando, Tharindu, Denman, Simon, Fookes, Clinton, Sridharan, Sridha

论文摘要

深度学习已成为手工制作方法的有力替代方法，用于识别综合声学和文本方式。基线系统使用深层卷积神经网络（DCNN）和经常性神经网络（RNN）独立地模型文本和声学模式模型，然后应用注意力，融合和分类。在本文中，我们提出了一种基于深度学习的方法，以利用和融合文本和声学数据进行情感分类。我们利用基于带有带通滤波器的参数化SINC函数的SINCNET层来从原始音频中提取声学特征，然后是DCNN。这种方法学习了为情感识别而调整的过滤库，并与直接在原始语音信号上应用卷积相比提供了更有效的功能。对于文本处理，我们使用两个分支（一个DCNN和BI方向RNN，然后是DCNN）并行，其中引入了交叉注意来推断从BI-RNN收到的隐藏表示的N-gram级别相关性。遵循现有的最新技术，我们评估了IEMOCAP数据集上提出的系统的性能。实验结果表明，所提出的系统的表现优于现有方法，提高了加权精度的3.5％。

Deep learning has emerged as a powerful alternative to hand-crafted methods for emotion recognition on combined acoustic and text modalities. Baseline systems model emotion information in text and acoustic modes independently using Deep Convolutional Neural Networks (DCNN) and Recurrent Neural Networks (RNN), followed by applying attention, fusion, and classification. In this paper, we present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification. We utilize a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN. This approach learns filter banks tuned for emotion recognition and provides more effective features compared to directly applying convolutions over the raw speech signal. For text processing, we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in parallel where cross attention is introduced to infer the N-gram level correlations on hidden representations received from the Bi-RNN. Following existing state-of-the-art, we evaluate the performance of the proposed system on the IEMOCAP dataset. Experimental results indicate that the proposed system outperforms existing methods, achieving 3.5% improvement in weighted accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题