深层多项式学习，用于与文本无关的扬声器验证

论文标题

深层多项式学习，用于与文本无关的扬声器验证

Deep multi-metric learning for text-independent speaker verification

论文作者

Xu, Jiwei, Wang, Xinggang, Feng, Bin, Liu, Wenyu

论文摘要

独立的说话者验证是一个重要的人工智能问题，具有广泛的应用程序，例如刑事调查，付款认证和基于利息的客户服务。独立的说话者验证的目的是确定两个给定的不受控制的话语是否来自同一说话者。使用深层神经网络为每个说话者提取语音功能是一个有希望的探索方向，而直接的解决方案是使用公制学习损失函数来训练判别特征提取网络。但是，单个损失功能通常具有一定的局限性。因此，我们使用深层多项式学习来解决该问题，并为此问题引入三种不同的损失，即三胞胎损失，n对损失和角损失。这三个损失功能以一种合作的方式工作，可以训练配备残留连接和挤压和激发注意力的功能提取网络。我们对大规模\ texttt {voxceleb2}数据集进行了实验，该数据集包含超过6,000美元的扬声器的一百万种说法，并且拟议的深神经网络获得了$ 3.48 \％$ $的同等错误率，这是一个非常有竞争的结果。可以在\ url {https://github.com/greatjiweix/dmmltisv}上获得培训和测试和预估计型号的代码，这是第一个公开可用的代码存储库，该代码可用于与正常系统的大规模独立文本扬声器验证，其性能与正常系统的系统相同。

Text-independent speaker verification is an important artificial intelligence problem that has a wide spectrum of applications, such as criminal investigation, payment certification, and interest-based customer services. The purpose of text-independent speaker verification is to determine whether two given uncontrolled utterances originate from the same speaker or not. Extracting speech features for each speaker using deep neural networks is a promising direction to explore and a straightforward solution is to train the discriminative feature extraction network by using a metric learning loss function. However, a single loss function often has certain limitations. Thus, we use deep multi-metric learning to address the problem and introduce three different losses for this problem, i.e., triplet loss, n-pair loss and angular loss. The three loss functions work in a cooperative way to train a feature extraction network equipped with Residual connections and squeeze-and-excitation attention. We conduct experiments on the large-scale \texttt{VoxCeleb2} dataset, which contains over a million utterances from over $6,000$ speakers, and the proposed deep neural network obtains an equal error rate of $3.48\%$, which is a very competitive result. Codes for both training and testing and pretrained models are available at \url{https://github.com/GreatJiweix/DmmlTiSV}, which is the first publicly available code repository for large-scale text-independent speaker verification with performance on par with the state-of-the-art systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题