论文标题

培训子集选择以弱监督

Training Subset Selection for Weak Supervision

论文作者

Lang, Hunter, Vijayaraghavan, Aravindan, Sontag, David

论文摘要

现有的弱监督方法使用弱信号涵盖的所有数据来培训分类器。我们在理论上和经验上都表明这并不总是最佳的。直观地,弱标记的数据数量与弱标签的精度之间存在权衡。我们通过将预处理的数据表示形式与剪切统计量(Muhlenbach等,2004)相结合,以选择(希望)弱标记培训数据的高质量子集来探索这种权衡。子集选择适用于任何标签模型和分类器,非常易于插入现有的弱监督管道,仅需要几行代码。我们显示我们的子集选择方法可改善对广泛的标签模型,分类器和数据集的弱监督性能。使用弱标记的数据,在基准任务上,弱监督管道的准确性最多可提高19%(绝对)。

Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源