论文标题
用于生成,注释和使用合成数据来回答的综合数据的管道
A Pipeline for Generating, Annotating and Employing Synthetic Data for Real World Question Answering
论文作者
论文摘要
问题回答(QA)是一个越来越多的研究领域,通常用来促进文档中的信息提取。最先进的质量检查模型通常是在Wikipedia等领域中心培训的,因此倾向于在不进行微调的情况下在室外文件上挣扎。我们证明,合成域特异性数据集可以轻松地使用域将军模型生成,同时仍然为质量检查性能提供了重大改进。我们为这项任务提供了两个新工具:一种灵活的管道,用于验证其上游模型的综合质量检查数据和培训模型,以及一个在线接口,以促进对该生成的数据的人类注释。使用此界面,人群工人标记了1117个合成质量质量质量检查对,然后我们用它来调整下游模型并提高域特异性质量质量绩效,以8.75 F1提高。
Question Answering (QA) is a growing area of research, often used to facilitate the extraction of information from within documents. State-of-the-art QA models are usually pre-trained on domain-general corpora like Wikipedia and thus tend to struggle on out-of-domain documents without fine-tuning. We demonstrate that synthetic domain-specific datasets can be generated easily using domain-general models, while still providing significant improvements to QA performance. We present two new tools for this task: A flexible pipeline for validating the synthetic QA data and training downstream models on it, and an online interface to facilitate human annotation of this generated data. Using this interface, crowdworkers labelled 1117 synthetic QA pairs, which we then used to fine-tune downstream models and improve domain-specific QA performance by 8.75 F1.