Wikiomnia：整个俄罗斯Wikipedia的生成质量检查

论文标题

Wikiomnia：整个俄罗斯Wikipedia的生成质量检查

WikiOmnia: generative QA corpus on the whole Russian Wikipedia

论文作者

Pisarevskaya, Dina, Shavrina, Tatiana

论文摘要

一般的质量检查领域一直在开发引用斯坦福问题答案数据集（小队）作为重要基准的方法。但是，汇编事实问题伴随着时间和劳动力的注释，从而限制了培训数据的潜在规模。我们介绍了Wikiomnia数据集，这是一套新的QA对以及相应的俄罗斯Wikipedia文章摘要部分，该部分由完全自动化的生成管道组成。该数据集包括Wikipedia的每篇可用文章，俄罗斯语言。 Wikiomnia管道可用开源，还经过测试，用于在其他领域（例如新闻文本，小说和社交媒体）上创建小队形式的QA。最终的数据集包括两个部分：整个俄罗斯Wikipedia的原始数据（7,930,873 QA QA对ugpt-3 XL和7,991,040 QA对带有Rut5-large段落的QA对，并在160,000个QA Piairs and Qa Qa-000 kagr（Rugatifification）上进行清洁（QA）QA，QA和QA的QA QA，QA和QA的QA，QA 33次，QA和QA QA，000,000多个。带有rut5-large的段落。

The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by time- and labour-consuming annotation, limiting the training data's potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).

下载PDF全文

下载文献需遵守相关版权规定

论文标题