利用QA数据集改善生成数据的增强

论文标题

利用QA数据集改善生成数据的增强

Leveraging QA Datasets to Improve Generative Data Augmentation

论文作者

Mekala, Dheeraj, Vu, Tu, Schick, Timo, Shang, Jingbo

论文摘要

在过去的几年中，生成语言模型（GLM）生成文本的能力已大大提高，从而可以用于生成数据增强。在这项工作中，我们提出了康达（Conda），这是一种进一步提高GLMS生成综合数据能力的方法，通过将数据生成重新定义为给定的问题解答（QA）对（QA）对的上下文生成，并利用QA数据集用于培训上下文生成器。然后，我们将下游任务投入到相同的问题回答格式中，并将微调的上下文生成器调整为目标任务域。最后，我们使用微调的GLM来生成相关上下文，而这些上下文又用作其相应任务的合成训练数据。我们在多个分类数据集上进行了广泛的实验，并证明了少数和零弹药设置的性能的实质性改进。我们的分析表明，需要高级推理能力（例如抽象性和常识性质量质量质量标准数据集）的质量保证数据集倾向于在几乎没有弹射和零摄影设置中最佳增强性能。

The ability of generative language models (GLMs) to generate text has improved considerably in the last few years, enabling their use for generative data augmentation. In this work, we propose CONDA, an approach to further improve GLMs' ability to generate synthetic data by reformulating data generation as context generation for a given question-answer (QA) pair and leveraging QA datasets for training context generators. Then, we cast downstream tasks into the same question answering format and adapt the fine-tuned context generators to the target task domain. Finally, we use the fine-tuned GLM to generate relevant contexts, which are in turn used as synthetic training data for their corresponding tasks. We perform extensive experiments on multiple classification datasets and demonstrate substantial improvements in performance for both few- and zero-shot settings. Our analysis reveals that QA datasets that require high-level reasoning abilities (e.g., abstractive and common-sense QA datasets) tend to give the best boost in performance in both few-shot and zero-shot settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题