通过时间超级分辨率网络的连续手语识别

论文标题

通过时间超级分辨率网络的连续手语识别

Continuous Sign Language Recognition via Temporal Super-Resolution Network

论文作者

Zhu, Qidan, Li, Jing, Yuan, Fei, Gan, Quan

论文摘要

针对以下问题：基于深度学习的空间层次层次连续语言识别模型具有大量计算，这限制了模型的实时应用，本文提出了一个时间上的超分辨率网络（TSRNET）。将数据重构为密集的特征序列，以减少整体模型计算，同时将最终识别精度损失保持在最低限度。连续的手语识别模型（CSLR）通过TSRNET主要由三个部分组成：框架级特征提取，时间序列特征提取和TSRNET，其中TSRNET位于框架级特征提取和时间序列之间的特征特征提取之间，主要包括两个分支：详细的分支：详细的分支和粗略的描述者。稀疏的框架级特征是通过两个设计的分支获得的特征融合的，因为重建的密集框架级特征序列，连接派时间分类（CTC）损失用于训练和优化，在时间序列特征提取部分。为了更好地恢复语义级别的信息，通过本文提出的自我生成的对抗训练方法对整个模型进行了培训，以降低模型错误率。训练方法将TSRNET视为发电机，框架级处理部分和时间处理部分是鉴别器。此外，为了统一模型精度损失在不同基准下的评估标准，本文提出了单词错误率偏差（WERD），该单词错误率（WERD）在估计的单词错误率（WER）和由重建的框架级别级别序列和WERD等完整的原始框架序列序列和完整的原始帧级序列和完整的原始帧级序列和完整的原始框架特征序列和WERD等完整的原始框架特征序列中获得的参考。在两个大规模手语数据集上的实验证明了该模型的有效性。

Aiming at the problem that the spatial-temporal hierarchical continuous sign language recognition model based on deep learning has a large amount of computation, which limits the real-time application of the model, this paper proposes a temporal super-resolution network(TSRNet). The data is reconstructed into a dense feature sequence to reduce the overall model computation while keeping the final recognition accuracy loss to a minimum. The continuous sign language recognition model(CSLR) via TSRNet mainly consists of three parts: frame-level feature extraction, time series feature extraction and TSRNet, where TSRNet is located between frame-level feature extraction and time-series feature extraction, which mainly includes two branches: detail descriptor and rough descriptor. The sparse frame-level features are fused through the features obtained by the two designed branches as the reconstructed dense frame-level feature sequence, and the connectionist temporal classification(CTC) loss is used for training and optimization after the time-series feature extraction part. To better recover semantic-level information, the overall model is trained with the self-generating adversarial training method proposed in this paper to reduce the model error rate. The training method regards the TSRNet as the generator, and the frame-level processing part and the temporal processing part as the discriminator. In addition, in order to unify the evaluation criteria of model accuracy loss under different benchmarks, this paper proposes word error rate deviation(WERD), which takes the error rate between the estimated word error rate (WER) and the reference WER obtained by the reconstructed frame-level feature sequence and the complete original frame-level feature sequence as the WERD. Experiments on two large-scale sign language datasets demonstrate the effectiveness of the proposed model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题