对LRS2数据集的重叠语音的视听识别

论文标题

对LRS2数据集的重叠语音的视听识别

Audio-visual Recognition of Overlapped speech for the LRS2 dataset

论文作者

Yu, Jianwei, Zhang, Shi-Xiong, Wu, Jian, Ghorbani, Shahram, Wu, Bo, Kang, Shiyin, Liu, Shansong, Liu, Xunying, Meng, Helen, Yu, Dong

论文摘要

迄今为止，自动识别演讲重叠仍然是一项高度挑战的任务。本文由人类言语感知的双峰性质的动机，研究了视听技术对重叠的语音识别的使用。解决了与视听语音识别（AVSR）系统的构建有关的三个问题。首先，研究了基本的体系结构设计，即AVSR系统的端到端和混合。其次，有目的设计的模态融合门用于稳健地集成音频和视觉特征。第三，与传统的管道结构相反，该体系结构包含明确的语音分离和识别组件，也提出了一种使用无晶格MMI（LF-MMI）判别标准始终如一地优化的简化和集成的AVSR系统。拟议的LF-MMI时间延迟神经网络（TDNN）系统为LRS2数据集建立了最先进的信息。从LRS2数据集模拟的重叠语音的实验表明，提出的AVSR系统的表现仅超过音频LF-MMI DNN系统，最多可低29.98 \％\％的单词错误率（WER）降低，并且产生的识别性能可与更复杂的管道系统相提并论。还获得了使用特征融合的基线AVSR系统的4.89 \％绝对性能的一致性提高。

Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second, purposefully designed modality fusion gates are used to robustly integrate the audio and visual features. Third, in contrast to a traditional pipelined architecture containing explicit speech separation and recognition components, a streamlined and integrated AVSR system optimized consistently using the lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The proposed LF-MMI time-delay neural network (TDNN) system establishes the state-of-the-art for the LRS2 dataset. Experiments on overlapped speech simulated from the LRS2 dataset suggest the proposed AVSR system outperformed the audio only baseline LF-MMI DNN system by up to 29.98\% absolute in word error rate (WER) reduction, and produced recognition performance comparable to a more complex pipelined system. Consistent performance improvements of 4.89\% absolute in WER reduction over the baseline AVSR system using feature fusion are also obtained.

下载PDF全文

下载文献需遵守相关版权规定

论文标题