论文标题
用于连续手语识别的多视图时空网络
Multi-View Spatial-Temporal Network for Continuous Sign Language Recognition
论文作者
论文摘要
手语是一种美丽的视觉语言,也是说话和听力受损的人使用的主要语言。但是,手语有许多复杂的表达方式,这对于公众来说很难理解和掌握。手语识别算法将显着促进听力受损的人和普通人之间的沟通。传统的连续手语识别通常使用基于卷积神经网络(CNN)和长期记忆网络(LSTM)的序列学习方法。这些方法只能单独学习空间和时间特征,这无法学习手语的复杂时空特征。 LSTM也很难学习长期依赖性。为了减轻这些问题,本文提出了一个多视图时空连续的手语识别网络。该网络由三个部分组成。第一部分是多视图时空特征提取器网络(MSTN),它可以直接提取RGB和骨骼数据的时空特征。第二个是基于变压器的手语编码网络,可以学习长期依赖性。第三是连接派时间分类(CTC)解码器网络,用于预测连续手语的全部含义。我们的算法在两个公共手语数据集SLR-100和Phoenix-Weather 2014T(RWTH)上进行了测试。结果,我们的方法在两个数据集上都能达到出色的性能。 SLR-100数据集上的单词错误率为1.9%,rwthoenix-Weather数据集的单词错误率为22.8%。
Sign language is a beautiful visual language and is also the primary language used by speaking and hearing-impaired people. However, sign language has many complex expressions, which are difficult for the public to understand and master. Sign language recognition algorithms will significantly facilitate communication between hearing-impaired people and normal people. Traditional continuous sign language recognition often uses a sequence learning method based on Convolutional Neural Network (CNN) and Long Short-Term Memory Network (LSTM). These methods can only learn spatial and temporal features separately, which cannot learn the complex spatial-temporal features of sign language. LSTM is also difficult to learn long-term dependencies. To alleviate these problems, this paper proposes a multi-view spatial-temporal continuous sign language recognition network. The network consists of three parts. The first part is a Multi-View Spatial-Temporal Feature Extractor Network (MSTN), which can directly extract the spatial-temporal features of RGB and skeleton data; the second is a sign language encoder network based on Transformer, which can learn long-term dependencies; the third is a Connectionist Temporal Classification (CTC) decoder network, which is used to predict the whole meaning of the continuous sign language. Our algorithm is tested on two public sign language datasets SLR-100 and PHOENIX-Weather 2014T (RWTH). As a result, our method achieves excellent performance on both datasets. The word error rate on the SLR-100 dataset is 1.9%, and the word error rate on the RWTHPHOENIX-Weather dataset is 22.8%.