论文标题
使用持续时间稳健损耗功能检测声音事件检测
Sound Event Detection Using Duration Robust Loss Function
论文作者
论文摘要
基于机器学习的许多声音事件检测方法(SED)将分段时间框架视为一个数据样本以建模训练。但是,声音事件的声音持续时间差异很大,具体取决于声音事件类,例如,声音事件``fan''的持续时间很长,而声音事件``鼠标clicking''是瞬时。因此,声音事件类之间的持续时间差异会导致SED的严重数据不平衡问题。在本文中,我们提出了一种使用持续时间稳健损耗函数的SED方法,该方法可以将模型训练集中在持续时间短的声音事件上。在提出的方法中,我们专注于声音事件持续时间与模型训练的易于/难度之间的关系。特别是,许多长时间的声音事件(例如,声音事件``fan'')是静止的声音,它们的声学特征的变化较小,模型训练很容易。同时,一些持续时间短的声音事件(例如,声音事件``object撞击'')具有多个音频模式,例如攻击,衰减和释放零件。因此,我们根据模型训练的易度性/难度,将课堂重新加权适用于二进制熵损失函数。使用TUT Sound事件2016/2017和TUT声学场景2016数据集进行的评估实验表明,该方法分别使用二进制距离熵损失功能与常规方法相比,在宏观和微型FSCORES中,宏观和微型FSCORES中,声音事件的检测性能提高了3.15和4.37个百分点。
Many methods of sound event detection (SED) based on machine learning regard a segmented time frame as one data sample to model training. However, the sound durations of sound events vary greatly depending on the sound event class, e.g., the sound event ``fan'' has a long time duration, while the sound event ``mouse clicking'' is instantaneous. The difference in the time duration between sound event classes thus causes a serious data imbalance problem in SED. In this paper, we propose a method for SED using a duration robust loss function, which can focus model training on sound events of short duration. In the proposed method, we focus on a relationship between the duration of the sound event and the ease/difficulty of model training. In particular, many sound events of long duration (e.g., sound event ``fan'') are stationary sounds, which have less variation in their acoustic features and their model training is easy. Meanwhile, some sound events of short duration (e.g., sound event ``object impact'') have more than one audio pattern, such as attack, decay, and release parts. We thus apply a class-wise reweighting to the binary-cross entropy loss function depending on the ease/difficulty of model training. Evaluation experiments conducted using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets show that the proposed method respectively improves the detection performance of sound events by 3.15 and 4.37 percentage points in macro- and micro-Fscores compared with a conventional method using the binary-cross entropy loss function.