培训子集选择以弱监督

论文标题

培训子集选择以弱监督

Training Subset Selection for Weak Supervision

论文作者

Lang, Hunter, Vijayaraghavan, Aravindan, Sontag, David

论文摘要

现有的弱监督方法使用弱信号涵盖的所有数据来培训分类器。我们在理论上和经验上都表明这并不总是最佳的。直观地，弱标记的数据数量与弱标签的精度之间存在权衡。我们通过将预处理的数据表示形式与剪切统计量（Muhlenbach等，2004）相结合，以选择（希望）弱标记培训数据的高质量子集来探索这种权衡。子集选择适用于任何标签模型和分类器，非常易于插入现有的弱监督管道，仅需要几行代码。我们显示我们的子集选择方法可改善对广泛的标签模型，分类器和数据集的弱监督性能。使用弱标记的数据，在基准任务上，弱监督管道的准确性最多可提高19％（绝对）。

Existing weak supervision approaches use all the data covered by weak signals to train a classifier. We show both theoretically and empirically that this is not always optimal. Intuitively, there is a tradeoff between the amount of weakly-labeled data and the precision of the weak labels. We explore this tradeoff by combining pretrained data representations with the cut statistic (Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the weakly-labeled training data. Subset selection applies to any label model and classifier and is very simple to plug in to existing weak supervision pipelines, requiring just a few lines of code. We show our subset selection method improves the performance of weak supervision for a wide range of label models, classifiers, and datasets. Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题