CUHK违反语音识别系统的最新进展

论文标题

CUHK违反语音识别系统的最新进展

Recent Progress in the CUHK Dysarthric Speech Recognition System

论文作者

Liu, Shansong, Geng, Mengzhe, Hu, Shoukang, Xie, Xurong, Cui, Mingyu, Yu, Jianwei, Liu, Xunying, Meng, Helen

论文摘要

尽管在过去的几十年中，自动语音识别（ASR）技术的进步迅速，但迄今为止，对言语无序的认识仍然是一项高度挑战的任务。无序的语音对基于目前数据密集的深度神经网络（DNN）的ASR技术提出了广泛的挑战，主要针对正常语音。本文介绍了香港中国大学（CUHK）的最新研究工作，以提高言语识别系统的性能，以在最大的公开upeech质心语音语料库上进行。一系列新型的建模技术，包括神经体系结构搜索，使用光谱时机扰动的数据增强，基于模型的扬声器适应性以及在视听语音识别（AVSR）系统框架内的视觉特征的跨域生成，以应对上述挑战。这些技术的组合在Uapeech测试集上产生的最低单词错误率（WER）为25.21％16个质心扬声器，并且在CUHK 2018年度障碍语音识别系统中，总体上降低了5.4％的绝对（17.6％）（相对相对17.6％），具有6-Way DNN DNN的言论和交叉系统的组合系统，并具有超过跨度的系统。贝叶斯模型的适应性进一步可以使用短短3.06秒的语音来快速适应单个符号扬声器。这些技术的功效进一步证明了Cudys Cantonese违反语音识别任务。

Despite the rapid progress of automatic speech recognition (ASR) technologies in the past few decades, recognition of disordered speech remains a highly challenging task to date. Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based ASR technologies that predominantly target normal speech. This paper presents recent research efforts at the Chinese University of Hong Kong (CUHK) to improve the performance of disordered speech recognition systems on the largest publicly available UASpeech dysarthric speech corpus. A set of novel modelling techniques including neural architectural search, data augmentation using spectra-temporal perturbation, model based speaker adaptation and cross-domain generation of visual features within an audio-visual speech recognition (AVSR) system framework were employed to address the above challenges. The combination of these techniques produced the lowest published word error rate (WER) of 25.21% on the UASpeech test set 16 dysarthric speakers, and an overall WER reduction of 5.4% absolute (17.6% relative) over the CUHK 2018 dysarthric speech recognition system featuring a 6-way DNN system combination and cross adaptation of out-of-domain normal speech data trained systems. Bayesian model adaptation further allows rapid adaptation to individual dysarthric speakers to be performed using as little as 3.06 seconds of speech. The efficacy of these techniques were further demonstrated on a CUDYS Cantonese dysarthric speech recognition task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题