使用时间卷积网络的口头启动

论文标题

使用时间卷积网络的口头启动

Lipreading using Temporal Convolutional Networks

论文作者

Martinez, Brais, Ma, Pingchuan, Petridis, Stavros, Pantic, Maja

论文摘要

由于深度学习的进步，最近唇读引起了很多研究的关注。当前识别隔离单词内部野生词的最新模型由残留网络和双向门控复发单元（BGRU）层组成。在这项工作中，我们解决了该模型的局限性，并提出了更改，以进一步提高其性能。首先，将BGRU层替换为时间卷积网络（TCN）。其次，我们大大简化了训练程序，这使我们能够在一个阶段训练该模型。第三，我们表明当前的最新方法论产生的模型不能很好地推广到序列长度上的变化，并且我们通过提出可变的长度增强来解决此问题。我们分别介绍了最大的公共可用数据集的结果，该数据集分别是英语和普通话，LRW和LRW1000的孤立单词识别。在这些数据集中，我们提出的模型的绝对改善分别为1.2％和3.2％，这是新的最新性能。

Lip-reading has attracted a lot of research attention lately thanks to advances in deep learning. The current state-of-the-art model for recognition of isolated words in-the-wild consists of a residual network and Bidirectional Gated Recurrent Unit (BGRU) layers. In this work, we address the limitations of this model and we propose changes which further improve its performance. Firstly, the BGRU layers are replaced with Temporal Convolutional Networks (TCN). Secondly, we greatly simplify the training procedure, which allows us to train the model in one single stage. Thirdly, we show that the current state-of-the-art methodology produces models that do not generalize well to variations on the sequence length, and we addresses this issue by proposing a variable-length augmentation. We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively. Our proposed model results in an absolute improvement of 1.2% and 3.2%, respectively, in these datasets which is the new state-of-the-art performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题