论文标题
情感视频使用深度复发的神经网络和神经模糊系统进行音频转换
Emotional Video to Audio Transformation Using Deep Recurrent Neural Networks and a Neuro-Fuzzy System
论文作者
论文摘要
如今,以类似于输入视频的音乐产生音乐是一个非常相关的问题。视频内容创建者和自动电影导演受益于维持观众的参与,这可以通过产生新颖的材料来促进他们的情绪强烈。此外,目前需要更多的移情计算机来帮助人类在诸如增强视觉和/或听力障碍者的感知能力之类的应用中。当前的方法在音乐生成步骤中忽略了视频的情感特征,只考虑静态图像而不是视频,无法产生新颖的音乐,并且需要高水平的人类努力和技能。在这项研究中,我们提出了一个新型的混合神经网络,该神经网络使用自适应的神经模糊推理系统从其视觉特征中预测了视频的情感,并具有深层的短期记忆复发性神经网络来产生其相应的音频信号,并具有相似的情感觉醒。由于其模糊属性,前者能够适当地对情绪进行适当的模型,并且由于以前的隐藏状态信息的可用性,后者能够很好地对具有动态时间属性的数据进行建模。我们提出的方法的新颖性在于提取视觉情感特征,以便将它们转变为具有相应情感方面的音频信号。定量实验显示,Lindsey和DEAP数据集中的低平均绝对误差为0.217和0.255,并且在频谱图中的全局特征相似。这表明我们的模型能够在视觉和音频功能之间适当地执行域转换。基于实验结果,我们的模型可以有效地生成与场景相匹配的音频,从而在两个数据集中引起观众的类似情感,并且我们的模型产生的音乐也更频繁地选择。
Generating music with emotion similar to that of an input video is a very relevant issue nowadays. Video content creators and automatic movie directors benefit from maintaining their viewers engaged, which can be facilitated by producing novel material eliciting stronger emotions in them. Moreover, there's currently a demand for more empathetic computers to aid humans in applications such as augmenting the perception ability of visually and/or hearing impaired people. Current approaches overlook the video's emotional characteristics in the music generation step, only consider static images instead of videos, are unable to generate novel music, and require a high level of human effort and skills. In this study, we propose a novel hybrid deep neural network that uses an Adaptive Neuro-Fuzzy Inference System to predict a video's emotion from its visual features and a deep Long Short-Term Memory Recurrent Neural Network to generate its corresponding audio signals with similar emotional inkling. The former is able to appropriately model emotions due to its fuzzy properties, and the latter is able to model data with dynamic time properties well due to the availability of the previous hidden state information. The novelty of our proposed method lies in the extraction of visual emotional features in order to transform them into audio signals with corresponding emotional aspects for users. Quantitative experiments show low mean absolute errors of 0.217 and 0.255 in the Lindsey and DEAP datasets respectively, and similar global features in the spectrograms. This indicates that our model is able to appropriately perform domain transformation between visual and audio features. Based on experimental results, our model can effectively generate audio that matches the scene eliciting a similar emotion from the viewer in both datasets, and music generated by our model is also chosen more often.