论文标题
句子级手语识别框架
Sentence-Level Sign Language Recognition Framework
论文作者
论文摘要
我们为句子级别的SLR提供了两种解决方案。句子级别的slr需要将手语句子的视频映射到光泽标签序列。 Connectionist时间分类(CTC)已用作两个模型的分类器级别。 CTC用于避免将句子预分段为单个单词。第一个模型是基于LRCN的模型,第二个模型是多提示网络。 LRCN是一种模型,其中CNN作为特征提取器在将其馈入LSTM之前将其应用于每个帧。在第一种方法中,没有任何先验知识被杠杆化。原帧被馈入18层LRCN,顶部为CTC。在第二种方法中,使用MediaPipe提取了与每个符号相关的三个主要特征(手形,手部位和手动移动信息)。 2D手形的地标已被用来创建手的骨架,然后将其馈送到Conv-LSTM型号。将手部位置和手部位置作为相对距离的头部馈送到分开的LSTM。然后,所有三个信息来源都已集成到具有CTC分类层的多提示网络中。我们评估了RWTH-PHOENIX-WEATHER上提议的模型的性能。在对模型超参数进行了过多的搜索之后,例如特征地图的数量,输入大小,批量大小,序列长度,LSTM存储单元,正则化和辍学,我们能够达到35个单词错误率(WER)。
We present two solutions to sentence-level SLR. Sentence-level SLR required mapping videos of sign language sentences to sequences of gloss labels. Connectionist Temporal Classification (CTC) has been used as the classifier level of both models. CTC is used to avoid pre-segmenting the sentences into individual words. The first model is an LRCN-based model, and the second model is a Multi-Cue Network. LRCN is a model in which a CNN as a feature extractor is applied to each frame before feeding them into an LSTM. In the first approach, no prior knowledge has been leveraged. Raw frames are fed into an 18-layer LRCN with a CTC on top. In the second approach, three main characteristics (hand shape, hand position, and hand movement information) associated with each sign have been extracted using Mediapipe. 2D landmarks of hand shape have been used to create the skeleton of the hands and then are fed to a CONV-LSTM model. Hand locations and hand positions as relative distance to head are fed to separate LSTMs. All three sources of information have been then integrated into a Multi-Cue network with a CTC classification layer. We evaluated the performance of proposed models on RWTH-PHOENIX-Weather. After performing an excessive search on model hyper-parameters such as the number of feature maps, input size, batch size, sequence length, LSTM memory cell, regularization, and dropout, we were able to achieve 35 Word Error Rate (WER).