多模式音乐信息检索：通过视觉计算增强音频分析以改进音乐视频分析

论文标题

多模式音乐信息检索：通过视觉计算增强音频分析以改进音乐视频分析

Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with Visual Computing for Improved Music Video Analysis

论文作者

Schindler, Alexander

论文摘要

本文将音频分析与计算机愿景结合在一起，从多模式的角度来接近音乐信息检索（MIR）任务。本文重点是音乐视频的视觉层提供的信息，以及如何利用它来增强和改善MIR研究领域的任务。这项工作的主要假设是基于这样的观察结果，即仅在视觉内容的基础上就可以识别某些表达类别（例如类型或主题），而不会听到声音。这导致了一个假设，即存在用于表达情绪或流派的视觉语言。进一步的结果可以得出结论，此视觉信息与音乐相关，因此应该对相应的MIR任务（例如音乐流派分类或情绪识别）有益。进行了一系列全面的实验和评估，这些实验和评估侧重于提取视觉信息及其在不同的MIR任务中的应用。创建了自定义数据集，适合开发和测试能够表示与音乐相关信息的视觉功能。评估范围从低级视觉特征到通过深卷积神经网络检索的高级概念。此外，引入了新的视觉特征，以捕获节奏的视觉模式。在所有这些实验中，基于音频的结果是视觉和视听方法的基准。实验是针对三个MIR任务的艺术家识别，音乐流派分类和跨类型分类进行的。实验表明，从视觉概念检测中获得的高级语义信息来利用视听方法，优于只有音频类型分类的精度16.43％。

This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective. This thesis focuses on the information provided by the visual layer of music videos and how it can be harnessed to augment and improve tasks of the MIR research domain. The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone, without the sound being heard. This leads to the hypothesis that there exists a visual language that is used to express mood or genre. In a further consequence it can be concluded that this visual information is music related and thus should be beneficial for the corresponding MIR tasks such as music genre classification or mood recognition. A series of comprehensive experiments and evaluations are conducted which are focused on the extraction of visual information and its application in different MIR tasks. A custom dataset is created, suitable to develop and test visual features which are able to represent music related information. Evaluations range from low-level visual features to high-level concepts retrieved by means of Deep Convolutional Neural Networks. Additionally, new visual features are introduced capturing rhythmic visual patterns. In all of these experiments the audio-based results serve as benchmark for the visual and audio-visual approaches. The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification. Experiments show that an audio-visual approach harnessing high-level semantic information gained from visual concept detection, outperforms audio-only genre-classification accuracy by 16.43%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题