使用预训练的视觉特征提取器和约束CTC解码的多发式神经体系结构用于提示语音识别

论文标题

使用预训练的视觉特征提取器和约束CTC解码的多发式神经体系结构用于提示语音识别

Multistream neural architectures for cued-speech recognition using a pre-trained visual feature extractor and constrained CTC decoding

论文作者

Sankar, Sanjana, Beautemps, Denis, Hueber, Thomas

论文摘要

本文提出了一种简单有效的方法，可以自动识别提示语音（CS），这是一种视觉交流工具，可帮助有听力障碍的人在手势的帮助下理解口语，这些手势可以独特地识别出辅助的语言中说话的音素。所提出的方法基于用于视觉特征提取的预训练的手和嘴唇跟踪器，以及基于多发式复发性神经网络的语音解码器，该神经网络接受了连接主义颞分类损失，并与发音词典结合使用。在法式CS数据集CSF18的更新版本上评估了所提出的系统，并为其进行了手动检查和纠正语音转录。拟议的系统以70.88％的语音水平的解码精度优于我们以前的CNN-HMM解码器，并与更复杂的基线竞争。

This paper proposes a simple and effective approach for automatic recognition of Cued Speech (CS), a visual communication tool that helps people with hearing impairment to understand spoken language with the help of hand gestures that can uniquely identify the uttered phonemes in complement to lipreading. The proposed approach is based on a pre-trained hand and lips tracker used for visual feature extraction and a phonetic decoder based on a multistream recurrent neural network trained with connectionist temporal classification loss and combined with a pronunciation lexicon. The proposed system is evaluated on an updated version of the French CS dataset CSF18 for which the phonetic transcription has been manually checked and corrected. With a decoding accuracy at the phonetic level of 70.88%, the proposed system outperforms our previous CNN-HMM decoder and competes with more complex baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题