检测篡改的语音：朝向端到端的参数可学习过滤方法

论文标题

检测篡改的语音：朝向端到端的参数可学习过滤方法

Detection of Doctored Speech: Towards an End-to-End Parametric Learn-able Filter Approach

论文作者

Arora, Rohit

论文摘要

自动扬声器验证系统在生物识别技术应用程序中具有逻辑控制访问和身份验证的潜力。如果ASV系统受到损害，那么很多事情就会受到威胁。初步工作分别对这些论文中开发的基于小波和MFCC的最先进检测技术进行了比较分析（Novoselov等，2016）（Alam等，2016a）。 ASVSPOOF 2015的结果证明了我们对基于小波的功能而不是MFCC功能的倾向。 ASVSPOOF 2019数据库的实验表明，传统手工制作的功能缺乏信誉，并为我们提供了更多的理由，以使用端到端的深度神经网络和更近期的技术进行进步。我们使用Sincnet架构作为我们的基线。我们通过用小波散射和连续的小波变换层代替SINC层，分别称为WSTNET和CWTNET的E2E深度学习模型。在ASVSPOOF 2019中评估现代欺骗攻击时，Fusion模型比传统手工制作的模型获得了62％和17％的相对改善和我们的Sincnet基线。 CWTNET中最终的比例分布和使用的比例数远非当前的任务最佳。因此，为了解决这个问题，我们用小波反卷积（WD）替换了CWT层（Khan和Yener，2018）层中的CWTNET体系结构层。该层计算与CWTNET相似的离散连续小波变换，但也使用后传播优化了比例参数。通过ASVSPOOF 2019数据集进行评估时，WDNET模型分别与CWTNET和SINCNET模型相对相对改善。这表明，与CWTNET提取的特征相比，提取了更广泛的特征，因为只有最重要和相关的频率区域才集中。

The Automatic Speaker Verification systems have potential in biometrics applications for logical control access and authentication. A lot of things happen to be at stake if the ASV system is compromised. The preliminary work presents a comparative analysis of the wavelet and MFCC-based state-of-the-art spoof detection techniques developed in these papers, respectively (Novoselov et al., 2016) (Alam et al., 2016a). The results on ASVspoof 2015 justify our inclination towards wavelet-based features instead of MFCC features. The experiments on the ASVspoof 2019 database show the lack of credibility of the traditional handcrafted features and give us more reason to progress towards using end-to-end deep neural networks and more recent techniques. We use Sincnet architecture as our baseline. We get E2E deep learning models, which we call WSTnet and CWTnet, respectively, by replacing the Sinc layer with the Wavelet Scattering and Continuous wavelet transform layers. The fusion model achieved 62% and 17% relative improvement over traditional handcrafted models and our Sincnet baseline when evaluated on the modern spoofing attacks in ASVspoof 2019. The final scale distribution and the number of scales used in CWTnet are far from optimal for the task at hand. So to solve this problem, we replaced the CWT layer with a Wavelet Deconvolution(WD) (Khan and Yener, 2018) layer in our CWTnet architecture. This layer calculates the Discrete-Continuous Wavelet Transform similar to the CWTnet but also optimizes the scale parameter using back-propagation. The WDnet model achieved 26% and 7% relative improvement over CWTnet and Sincnet models respectively when evaluated over ASVspoof 2019 dataset. This shows that more generalized features are extracted as compared to the features extracted by CWTnet as only the most important and relevant frequency regions are focused upon.

下载PDF全文

下载文献需遵守相关版权规定

论文标题