时间建模很重要：一种新型的时间情感建模方法，用于语音情感识别

论文标题

时间建模很重要：一种新型的时间情感建模方法，用于语音情感识别

Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition

论文作者

Ye, Jiaxin, Wen, Xin-cheng, Wei, Yujie, Xu, Yong, Liu, Kunhong, Shan, Hongming

论文摘要

语音情感识别（SER）通过从语音信号中推断人类情感和情感状态来改善人与机器之间的相互作用。尽管最近的作品主要集中在手工制作的功能中挖掘时空信息，但我们探索了如何对动态时间尺度的语音情绪的时间模式进行建模。为了实现这一目标，我们为SER引入了一种新型的时间情感建模方法，称为时间感知的双向多尺度网络（TIM-NET），该网络从各种时间尺度学习了多规模上下文的情感表述。具体而言，Tim-Net首先采用时间感知的块来学习时间情感表示，然后整合过去和未来的互补信息，以丰富上下文表示，最后，融合了多个时间尺度特征，以更好地适应情感变化。六个基准SER数据集的广泛实验结果表明，Tim-NET的出色表现，平均UAR和战争在每个语料库上的平均UAR和战争的提高2.34％和2.61％。源代码可在https://github.com/jiaxin-ye/tim-net_ser上找到。

Speech emotion recognition (SER) plays a vital role in improving the interactions between humans and machines by inferring human emotion and affective states from speech signals. Whereas recent works primarily focus on mining spatiotemporal information from hand-crafted features, we explore how to model the temporal patterns of speech emotions from dynamic temporal scales. Towards that goal, we introduce a novel temporal emotional modeling approach for SER, termed Temporal-aware bI-direction Multi-scale Network (TIM-Net), which learns multi-scale contextual affective representations from various time scales. Specifically, TIM-Net first employs temporal-aware blocks to learn temporal affective representation, then integrates complementary information from the past and the future to enrich contextual representations, and finally, fuses multiple time scale features for better adaptation to the emotional variation. Extensive experimental results on six benchmark SER datasets demonstrate the superior performance of TIM-Net, gaining 2.34% and 2.61% improvements of the average UAR and WAR over the second-best on each corpus. The source code is available at https://github.com/Jiaxin-Ye/TIM-Net_SER.

下载PDF全文

下载文献需遵守相关版权规定

论文标题