说话者意识的语音转换器

论文标题

说话者意识的语音转换器

Speaker-aware speech-transformer

论文作者

Fan, Zhiyun, Li, Jie, Zhou, Shiyu, Xu, Bo

论文摘要

最近，端到端（E2E）模型成为传统混合自动语音识别（ASR）系统的竞争替代品。但是，在训练和测试条件下，他们仍然患有说话者不匹配。在本文中，我们使用语音转变器（ST）作为研究平台，以调查E2E模型的说话者意识到的培训。我们提出了一个称为扬声器感知语音转换器（SAST）的模型，该模型是配备扬声器注意模块（SAM）的标准ST。 SAM有由I-向量制成的静态扬声器知识块（SKB）。在每个时间步骤中，编码器输出都会参与块中的I-向量，并生成加权的组合扬声器嵌入向量，这有助于模型使扬声器变化正常化。以这种方式训练的SAST模型变得独立于特定的培训演讲者，因此可以更好地概括看不见的测试扬声器。我们研究了SAM的不同因素。 AISHELL-1任务的实验结果表明，SAST在依赖说话者的基线（SI）基线上实现了相对6.5％的CER（CERR）。此外，我们证明，即使SKB中的i-向量都来自除声学训练集以外的其他数据源，SAST仍然运作良好。

Recently, end-to-end (E2E) models become a competitive alternative to the conventional hybrid automatic speech recognition (ASR) systems. However, they still suffer from speaker mismatch in training and testing condition. In this paper, we use Speech-Transformer (ST) as the study platform to investigate speaker aware training of E2E models. We propose a model called Speaker-Aware Speech-Transformer (SAST), which is a standard ST equipped with a speaker attention module (SAM). The SAM has a static speaker knowledge block (SKB) that is made of i-vectors. At each time step, the encoder output attends to the i-vectors in the block, and generates a weighted combined speaker embedding vector, which helps the model to normalize the speaker variations. The SAST model trained in this way becomes independent of specific training speakers and thus generalizes better to unseen testing speakers. We investigate different factors of SAM. Experimental results on the AISHELL-1 task show that SAST achieves a relative 6.5% CER reduction (CERR) over the speaker-independent (SI) baseline. Moreover, we demonstrate that SAST still works quite well even if the i-vectors in SKB all come from a different data source other than the acoustic training set.

下载PDF全文

下载文献需遵守相关版权规定

论文标题