AUTSL：大型多模式土耳其手语数据集和基线方法

论文标题

AUTSL：大型多模式土耳其手语数据集和基线方法

AUTSL: A Large Scale Multi-modal Turkish Sign Language Dataset and Baseline Methods

论文作者

Sincan, Ozge Mercanoglu, Keles, Hacer Yalim

论文摘要

手语识别是一个具有挑战性的问题，可以通过多种来源的本地和全球表达来识别标志，即手动形状和方向，手势，身体姿势和面部表情。在现实生活环境中，在计算上解决此问题的大量词汇仍然是一个挑战，即使是最先进的模型也是一个挑战。在这项研究中，我们提出了一个新的LargesCale多模式土耳其手语数据集（AUTSL），并提供基准测试，并为绩效评估提供了基线模型。我们的数据集由43个不同的签名者和38,336个孤立的符号视频样本组成的226个标志组成。样品包含在室内和室外环境中记录的各种背景。此外，录音中的空间位置和签名人的姿势也有所不同。用Microsoft Kinect V2记录每个样本，并包含RGB，深度和骨架模式。我们准备了用于模型的用户独立评估的基准测试和测试集。我们培训了几个基于深度学习的模型，并使用基准进行了经验评估。我们使用CNN来提取功能，单向和双向LSTM模型来表征时间信息。我们还将功能合并模块和时间关注纳入了我们的模型，以改善性能。我们在AUTSL和Montalbano数据集上评估了基线模型。我们的模型通过Montalbano数据集的最新方法，即96.11％的精度，取得了竞争成果。在Autsl随机火车测试拆分中，我们的型号的精度最高为95.95％。在拟议的与用户无关的基准数据集中，我们最佳的基线模型达到了62.02％的精度。相同基线模型的性能中的差距显示了我们的基准数据集中固有的挑战。 AUTSL基准数据集可在https://cvml.ankara.edu.tr上公开获取。

Sign language recognition is a challenging problem where signs are identified by simultaneous local and global articulations of multiple sources, i.e. hand shape and orientation, hand movements, body posture, and facial expressions. Solving this problem computationally for a large vocabulary of signs in real life settings is still a challenge, even with the state-of-the-art models. In this study, we present a new largescale multi-modal Turkish Sign Language dataset (AUTSL) with a benchmark and provide baseline models for performance evaluations. Our dataset consists of 226 signs performed by 43 different signers and 38,336 isolated sign video samples in total. Samples contain a wide variety of backgrounds recorded in indoor and outdoor environments. Moreover, spatial positions and the postures of signers also vary in the recordings. Each sample is recorded with Microsoft Kinect v2 and contains RGB, depth, and skeleton modalities. We prepared benchmark training and test sets for user independent assessments of the models. We trained several deep learning based models and provide empirical evaluations using the benchmark; we used CNNs to extract features, unidirectional and bidirectional LSTM models to characterize temporal information. We also incorporated feature pooling modules and temporal attention to our models to improve the performances. We evaluated our baseline models on AUTSL and Montalbano datasets. Our models achieved competitive results with the state-of-the-art methods on Montalbano dataset, i.e. 96.11% accuracy. In AUTSL random train-test splits, our models performed up to 95.95% accuracy. In the proposed user-independent benchmark dataset our best baseline model achieved 62.02% accuracy. The gaps in the performances of the same baseline models show the challenges inherent in our benchmark dataset. AUTSL benchmark dataset is publicly available at https://cvml.ankara.edu.tr.

下载PDF全文

下载文献需遵守相关版权规定

论文标题