论文标题

早期预测文本分类精度和通过积极学习的f量

Early Forecasting of Text Classification Accuracy and F-Measure with Active Learning

论文作者

Orth, Thomas, Bloodgood, Michael

论文摘要

在创建文本分类系统时,主要瓶颈之一是培训数据的注释。已经提出了活跃的学习,以使用停止方法来解决这种瓶颈,以最大程度地减少数据注释的成本。改善停止方法的实用性的重要功能是有效预测文本分类模型的性能。预测可以通过在学习进展的过程中使用对数模型进行回归的对数模型进行。一个关键的未开发问题是需要哪些数据才能进行准确的预测。有一个张力,希望使用较少的数据,以便可以更早地进行预测,这更有用,而不是希望使用更多数据,以便预测可以更准确。我们发现,在使用主动学习时,更早地生成预测以使其更有用而不是浪费注释工作更为重要。我们研究了使用准确性和F量表作为文本分类系统性能指标时预测难度的差异,我们发现F量更难预测。我们对具有不同特征和三种不同基础机器学习算法的不同语义域中的七个文本分类数据集进行实验。我们发现,对于决策树学习,预测最容易,对于支持向量机中的适度,对于神经网络来说最困难。

When creating text classification systems, one of the major bottlenecks is the annotation of training data. Active learning has been proposed to address this bottleneck using stopping methods to minimize the cost of data annotation. An important capability for improving the utility of stopping methods is to effectively forecast the performance of the text classification models. Forecasting can be done through the use of logarithmic models regressed on some portion of the data as learning is progressing. A critical unexplored question is what portion of the data is needed for accurate forecasting. There is a tension, where it is desirable to use less data so that the forecast can be made earlier, which is more useful, versus it being desirable to use more data, so that the forecast can be more accurate. We find that when using active learning it is even more important to generate forecasts earlier so as to make them more useful and not waste annotation effort. We investigate the difference in forecasting difficulty when using accuracy and F-measure as the text classification system performance metrics and we find that F-measure is more difficult to forecast. We conduct experiments on seven text classification datasets in different semantic domains with different characteristics and with three different base machine learning algorithms. We find that forecasting is easiest for decision tree learning, moderate for Support Vector Machines, and most difficult for neural networks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源