论文标题

Wikiomnia:整个俄罗斯Wikipedia的生成质量检查

WikiOmnia: generative QA corpus on the whole Russian Wikipedia

论文作者

Pisarevskaya, Dina, Shavrina, Tatiana

论文摘要

一般的质量检查领域一直在开发引用斯坦福问题答案数据集(小队)作为重要基准的方法。但是,汇编事实问题伴随着时间和劳动力的注释,从而限制了培训数据的潜在规模。我们介绍了Wikiomnia数据集,这是一套新的QA对以及相应的俄罗斯Wikipedia文章摘要部分,该部分由完全自动化的生成管道组成。该数据集包括Wikipedia的每篇可用文章,俄罗斯语言。 Wikiomnia管道可用开源,还经过测试,用于在其他领域(例如新闻文本,小说和社交媒体)上创建小队形式的QA。最终的数据集包括两个部分:整个俄罗斯Wikipedia的原始数据(7,930,873 QA QA对ugpt-3 XL和7,991,040 QA对带有Rut5-large段落的QA对,并在160,000个QA Piairs and Qa Qa-000 kagr(Rugatifification)上进行清洁(QA)QA,QA和QA的QA QA,QA和QA的QA,QA 33次,QA和QA QA,000,000多个。带有rut5-large的段落。

The General QA field has been developing the methodology referencing the Stanford Question answering dataset (SQuAD) as the significant benchmark. However, compiling factual questions is accompanied by time- and labour-consuming annotation, limiting the training data's potential size. We present the WikiOmnia dataset, a new publicly available set of QA-pairs and corresponding Russian Wikipedia article summary sections, composed with a fully automated generative pipeline. The dataset includes every available article from Wikipedia for the Russian language. The WikiOmnia pipeline is available open-source and is also tested for creating SQuAD-formatted QA on other domains, like news texts, fiction, and social media. The resulting dataset includes two parts: raw data on the whole Russian Wikipedia (7,930,873 QA pairs with paragraphs for ruGPT-3 XL and 7,991,040 QA pairs with paragraphs for ruT5-large) and cleaned data with strict automatic verification (over 160,000 QA pairs with paragraphs for ruGPT-3 XL and over 3,400,000 QA pairs with paragraphs for ruT5-large).

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源