基于波形的语音活动检测利用多分支编码器利用完全卷积网络

论文标题

基于波形的语音活动检测利用多分支编码器利用完全卷积网络

Waveform-based Voice Activity Detection Exploiting Fully Convolutional networks with Multi-Branched Encoders

论文作者

Yu, Cheng, Hung, Kuo-Hsuan, Lin, I-Fan, Fu, Szu-Wei, Tsao, Yu, Hung, Jeih-weih

论文摘要

在这项研究中，我们提出了一个具有完全卷积网络的编码器结构化系统，以直接在时间域波形上实现语音活动检测（VAD）。提出的系统处理输入波形以识别其段为语音或非语音。这种新型基于波形的VAD算法，带有简短的符号“ WVAD”，具有两个主要特殊性。首先，与大多数使用光谱特征的传统VAD系统相比，WVAD中使用的原始波形包含更全面的信息，因此应该促进更准确的语音/非语音预测。其次，基于多支架构，可以通过使用称为WEVAD的编码器集合来扩展WVAD，该编码器将多个属性信息包含在话语中，因此可以为指定的声学条件产生更好的VAD性能。我们在两个数据集中评估了所提供的WVAD和WEVAD的VAD任务：首先，在Aurora2上进行的实验表明，WVAD的表现优于许多最先进的VAD算法。接下来，TMHINT任务证实，通过将多个属性结合起来，Wevad的行为比WVAD更好。

In this study, we propose an encoder-decoder structured system with fully convolutional networks to implement voice activity detection (VAD) directly on the time-domain waveform. The proposed system processes the input waveform to identify its segments to be either speech or non-speech. This novel waveform-based VAD algorithm, with a short-hand notation "WVAD", has two main particularities. First, as compared to most conventional VAD systems that use spectral features, raw-waveforms employed in WVAD contain more comprehensive information and thus are supposed to facilitate more accurate speech/non-speech predictions. Second, based on the multi-branched architecture, WVAD can be extended by using an ensemble of encoders, referred to as WEVAD, that incorporate multiple attribute information in utterances, and thus can yield better VAD performance for specified acoustic conditions. We evaluated the presented WVAD and WEVAD for the VAD task in two datasets: First, the experiments conducted on AURORA2 reveal that WVAD outperforms many state-of-the-art VAD algorithms. Next, the TMHINT task confirms that through combining multiple attributes in utterances, WEVAD behaves even better than WVAD.

下载PDF全文

下载文献需遵守相关版权规定

论文标题