DOCVQA：文档图像上VQA的数据集

论文标题

DOCVQA：文档图像上VQA的数据集

DocVQA: A Dataset for VQA on Document Images

论文作者

Mathew, Minesh, Karatzas, Dimosthenis, Jawahar, C. V.

论文摘要

我们在称为docvqa的文档图像上提出了一个用于视觉问题回答（VQA）的新数据集。该数据集由12,000多个文档图像中定义的50,000个问题组成。与VQA和阅读理解的类似数据集相比，数据集对数据集的详细分析。我们通过采用现有的VQA和阅读理解模型来报告几个基线结果。尽管现有模型在某些类型的问题上表现良好，但与人类绩效相比，绩效差距很大（精度为94.36％）。这些模型需要具体改进，了解文档的理解至关重要的问题。数据集，代码和排行榜可在docvqa.org上找到

We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets for VQA and reading comprehension is presented. We report several baseline results by adopting existing VQA and reading comprehension models. Although the existing models perform reasonably well on certain types of questions, there is large performance gap compared to human performance (94.36% accuracy). The models need to improve specifically on questions where understanding structure of the document is crucial. The dataset, code and leaderboard are available at docvqa.org

下载PDF全文

下载文献需遵守相关版权规定

论文标题