论文标题
关于极端多标签分类的数据增强
On Data Augmentation for Extreme Multi-label Classification
论文作者
论文摘要
在本文中,我们重点介绍了极端多标签分类(XMC)问题的数据增强。 XMC最具挑战性的问题之一是较长的尾标发行版,即使是强大的模型也无法进行监督。为了减轻这种标签偏见,我们提出了一个简单有效的增强框架和新的最新分类器。我们的增强框架利用预先训练的GPT-2模型来生成输入文本的标签 - 不变扰动,以增强现有的培训数据。结果,它比基线模型进行了实质性改进。我们的贡献是两因素:(1)我们引入了一种新的最先进的分类器,该分类器将其与Roberta一起使用,并将其与我们的增强框架相结合以进一步改进; (2)我们介绍了XMC任务中不同增强方法的有效性的广泛研究。
In this paper, we focus on data augmentation for the extreme multi-label classification (XMC) problem. One of the most challenging issues of XMC is the long tail label distribution where even strong models suffer from insufficient supervision. To mitigate such label bias, we propose a simple and effective augmentation framework and a new state-of-the-art classifier. Our augmentation framework takes advantage of the pre-trained GPT-2 model to generate label-invariant perturbations of the input texts to augment the existing training data. As a result, it present substantial improvements over baseline models. Our contributions are two-factored: (1) we introduce a new state-of-the-art classifier that uses label attention with RoBERTa and combine it with our augmentation framework for further improvement; (2) we present a broad study on how effective are different augmentation methods in the XMC task.