使用混合型号和特定文件的填充性的中世纪手稿上的开源手写文本识别

论文标题

使用混合型号和特定文件的填充性的中世纪手稿上的开源手写文本识别

Open Source Handwritten Text Recognition on Medieval Manuscripts using Mixed Models and Document-Specific Finetuning

论文作者

Reul, Christian, Tomasek, Stefan, Langhanki, Florian, Springmann, Uwe

论文摘要

本文介绍了德国中世纪手稿上实用和开源手写文本识别（HTR）的任务。我们报告了我们为构建混合识别模型的努力，这些模型可以在没有任何特定文件的培训的情况下使用，但也可以通过在几页转录的文本（地面真相）上训练新模型来进行填充的起点。为了训练混合模型，我们收集了35个手稿和CA的语料库。 12.5k文字线，用于两种广泛使用的手写样式，哥特式和巴斯塔达草书。在四个看不见的手稿上评估混合模型的混合模型导致平均字符错误率（CER）为6.22％。在2、4和最终32页进行训练之后，CER分别降至3.27％，2.58％和1.65％。虽然对模型的内域识别和培训（Bastarda模型对Bastarda材料，哥特式的哥特式材料）毫不奇怪地取得了最佳的结果，但仍显示出对看不见脚本的填充模型，但仍表现出比从头开始的训练。我们的新混合模型已公开向社区提供。

This paper deals with the task of practical and open source Handwritten Text Recognition (HTR) on German medieval manuscripts. We report on our efforts to construct mixed recognition models which can be applied out-of-the-box without any further document-specific training but also serve as a starting point for finetuning by training a new model on a few pages of transcribed text (ground truth). To train the mixed models we collected a corpus of 35 manuscripts and ca. 12.5k text lines for two widely used handwriting styles, Gothic and Bastarda cursives. Evaluating the mixed models out-of-the-box on four unseen manuscripts resulted in an average Character Error Rate (CER) of 6.22%. After training on 2, 4 and eventually 32 pages the CER dropped to 3.27%, 2.58%, and 1.65%, respectively. While the in-domain recognition and training of models (Bastarda model to Bastarda material, Gothic to Gothic) unsurprisingly yielded the best results, finetuning out-of-domain models to unseen scripts was still shown to be superior to training from scratch. Our new mixed models have been made openly available to the community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题