利用logits：联合手语识别和法术校正

论文标题

利用logits：联合手语识别和法术校正

Exploiting the Logits: Joint Sign Language Recognition and Spell-Correction

论文作者

Runkel, Christina, Dorenkamp, Stefan, Bauermeister, Hartmut, Moeller, Michael

论文摘要

机器学习技术在图像的自动语义分析中表现出色，在具有挑战性的基准上达到了人类水平的表演。然而，由于输入数据的显着更高的维度，对视频的语义分析仍然具有挑战性。通过研究对德语手语视频的自动识别，我们证明，在2.800视频的相对稀缺培训数据中，现代深度学习架构用于视频分析（例如Resnext），以及在大型手势识别任务上进行转移学习，可以实现约75％的角色精度。考虑到这使我们的概率低于25％，即正确拼写了5个字母的单词，因此拼写校正系统对于产生可读的输出至关重要。本文的贡献是提出一个卷积神经网络，以期待字符识别网络（而不是拼写错误的单词）作为输入的软性校正校正。我们证明，随着网络学习输入的信息，纯粹对软掌输入的学习与稀缺训练数据相结合。相反，在分类输出的几个变体上训练网络，即按恒定因素扩展，增加随机噪声，混合软效果和硬质输入或纯粹对硬质量输入的培训，从而可以更好地泛化，同时从这些输出中隐藏的重要信息（尽管具有98％top-top-top-top的精度），从而获得了更高的信息，从而获得了珍贵的特征，并且具有较低的文字。

Machine learning techniques have excelled in the automatic semantic analysis of images, reaching human-level performances on challenging benchmarks. Yet, the semantic analysis of videos remains challenging due to the significantly higher dimensionality of the input data, respectively, the significantly higher need for annotated training examples. By studying the automatic recognition of German sign language videos, we demonstrate that on the relatively scarce training data of 2.800 videos, modern deep learning architectures for video analysis (such as ResNeXt) along with transfer learning on large gesture recognition tasks, can achieve about 75% character accuracy. Considering that this leaves us with a probability of under 25% that a 5 letter word is spelled correctly, spell-correction systems are crucial for producing readable outputs. The contribution of this paper is to propose a convolutional neural network for spell-correction that expects the softmax outputs of the character recognition network (instead of a misspelled word) as an input. We demonstrate that purely learning on softmax inputs in combination with scarce training data yields overfitting as the network learns the inputs by heart. In contrast, training the network on several variants of the logits of the classification output i.e. scaling by a constant factor, adding of random noise, mixing of softmax and hardmax inputs or purely training on hardmax inputs, leads to better generalization while benefitting from the significant information hidden in these outputs (that have 98% top-5 accuracy), yielding a readable text despite the comparably low character accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题