论文标题
通过时间域扬声器来改善对目标语音提取的扬声器歧视
Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam
论文作者
论文摘要
鉴于有关目标扬声器的线索,在混合物中提取单个目标源的目标语音提取引起了人们越来越多的关注。我们最近提出了Speakerbeam,它利用了目标扬声器的适应说法来提取他/她的语音特征,然后用来指导神经网络来提取该说话者的语音。 SpeakerBeam提出了语音分离的实用替代方法,因为它可以在语音中跟踪目标演讲者的语音,并实现有希望的语音提取表现。但是,当说话者具有相似的语音特征(例如在同性混合物中)时,它有时会失败,因为很难将目标扬声器与干扰扬声器区分开。在本文中,我们调查了提高扬声器歧视能力的策略。首先,我们提出了类似于时间域音频分离网络(TASNET)的扬声器的时间域实现,该实现已实现了语音分离的最新性能。此外,我们研究(1)当可用麦克风阵列记录时,使用空间特征来更好地辨别扬声器,(2)添加辅助扬声器识别损失,以帮助学习更多歧视性语音特征。我们通过实验表明,这些策略大大改善了语音提取性能,尤其是对于同性混合物,并且在目标语音提取方面表现优于tasnet。
Target speech extraction, which extracts a single target source in a mixture given clues about the target speaker, has attracted increasing attention. We have recently proposed SpeakerBeam, which exploits an adaptation utterance of the target speaker to extract his/her voice characteristics that are then used to guide a neural network towards extracting speech of that speaker. SpeakerBeam presents a practical alternative to speech separation as it enables tracking speech of a target speaker across utterances, and achieves promising speech extraction performance. However, it sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures, because it is difficult to discriminate the target speaker from the interfering speakers. In this paper, we investigate strategies for improving the speaker discrimination capability of SpeakerBeam. First, we propose a time-domain implementation of SpeakerBeam similar to that proposed for a time-domain audio separation network (TasNet), which has achieved state-of-the-art performance for speech separation. Besides, we investigate (1) the use of spatial features to better discriminate speakers when microphone array recordings are available, (2) adding an auxiliary speaker identification loss for helping to learn more discriminative voice characteristics. We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures, and outperform TasNet in terms of target speech extraction.