使用持续时间稳健损耗功能检测声音事件检测

论文标题

使用持续时间稳健损耗功能检测声音事件检测

Sound Event Detection Using Duration Robust Loss Function

论文作者

Akiyama, Daichi, Imoto, Keisuke, Tonami, Noriyuki, Okamoto, Yuki, Yamanishi, Ryosuke, Fukumori, Takahiro, Yamashita, Yoichi

论文摘要

基于机器学习的许多声音事件检测方法（SED）将分段时间框架视为一个数据样本以建模训练。但是，声音事件的声音持续时间差异很大，具体取决于声音事件类，例如，声音事件``fan''的持续时间很长，而声音事件``鼠标clicking''是瞬时。因此，声音事件类之间的持续时间差异会导致SED的严重数据不平衡问题。在本文中，我们提出了一种使用持续时间稳健损耗函数的SED方法，该方法可以将模型训练集中在持续时间短的声音事件上。在提出的方法中，我们专注于声音事件持续时间与模型训练的易于/难度之间的关系。特别是，许多长时间的声音事件（例如，声音事件``fan''）是静止的声音，它们的声学特征的变化较小，模型训练很容易。同时，一些持续时间短的声音事件（例如，声音事件``object撞击''）具有多个音频模式，例如攻击，衰减和释放零件。因此，我们根据模型训练的易度性/难度，将课堂重新加权适用于二进制熵损失函数。使用TUT Sound事件2016/2017和TUT声学场景2016数据集进行的评估实验表明，该方法分别使用二进制距离熵损失功能与常规方法相比，在宏观和微型FSCORES中，宏观和微型FSCORES中，声音事件的检测性能提高了3.15和4.37个百分点。

Many methods of sound event detection (SED) based on machine learning regard a segmented time frame as one data sample to model training. However, the sound durations of sound events vary greatly depending on the sound event class, e.g., the sound event ``fan'' has a long time duration, while the sound event ``mouse clicking'' is instantaneous. The difference in the time duration between sound event classes thus causes a serious data imbalance problem in SED. In this paper, we propose a method for SED using a duration robust loss function, which can focus model training on sound events of short duration. In the proposed method, we focus on a relationship between the duration of the sound event and the ease/difficulty of model training. In particular, many sound events of long duration (e.g., sound event ``fan'') are stationary sounds, which have less variation in their acoustic features and their model training is easy. Meanwhile, some sound events of short duration (e.g., sound event ``object impact'') have more than one audio pattern, such as attack, decay, and release parts. We thus apply a class-wise reweighting to the binary-cross entropy loss function depending on the ease/difficulty of model training. Evaluation experiments conducted using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets show that the proposed method respectively improves the detection performance of sound events by 3.15 and 4.37 percentage points in macro- and micro-Fscores compared with a conventional method using the binary-cross entropy loss function.

下载PDF全文

下载文献需遵守相关版权规定

论文标题