LAM数据集：线条级手写文本识别的新颖基准标准

论文标题

LAM数据集：线条级手写文本识别的新颖基准标准

The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition

论文作者

Cascianelli, Silvia, Pippi, Vittorio, Maarand, Martin, Cornia, Marcella, Baraldi, Lorenzo, Kermorvant, Christopher, Cucchiara, Rita

论文摘要

手写文本识别（HTR）是计算机视觉和自然语言处理的交集的一个开放问题。在处理历史手稿时，主要的挑战是由于保存纸张支撑，手写的可变性 - 甚至在广泛的时间内的同一作者的变异性 - 以及来自古代，代表不良的语言的数据稀缺。为了促进有关该主题的研究，在本文中，我们介绍了Ludovico Antonio Muratori（LAM）数据集，这是一家大型线条级的HTR HTR数据集，该数据集是由单个作者编辑的60年来编辑的意大利古代手稿。该数据集有两种配置：基本分裂和基于日期的分裂，该分裂考虑了作者的年龄。第一个设置旨在研究意大利语的古代文档中的HTR，而第二个设置则侧重于HTR系统在无法获得培训数据的时期内识别同一作者编写的文本的能力。对于两种配置，我们都在其他线路级HTR基准方面分析了定量和定性特征，并介绍了最新的HTR架构的识别性能。该数据集可在\ url {https://aimagelab.ing.unimore.it/go/lam}下载。

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting -- even of the same author over a wide time-span -- and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both configurations, we analyze quantitative and qualitative characteristics, also with respect to other line-level HTR benchmarks, and present the recognition performance of state-of-the-art HTR architectures. The dataset is available for download at \url{https://aimagelab.ing.unimore.it/go/lam}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题