论文标题
无监督的基于池的主动学习线性回归
Unsupervised Pool-Based Active Learning for Linear Regression
论文作者
论文摘要
在许多现实世界的机器学习应用程序中,可以轻松获得未标记的数据,但是标记它们非常耗时和/或昂贵。因此,希望能够选择以标记的最佳样本,以便可以从最少标记的数据中训练好的机器学习模型。主动学习(AL)已被广泛用于此目的。但是,大多数现有的AL方法都受到监督:他们从少量标记的样本中训练初始模型,根据该模型查询新样本,然后迭代地更新模型。他们中很少有人考虑过完全无监督的问题,即从零开始,如何最佳地选择要标记的前几个样本,而根本不知道任何标签信息。这个问题非常具有挑战性,因为无法使用标签信息。本文研究了无监督的基于池的AL,以解决线性回归问题。我们提出了一种新颖的方法,该方法同时考虑了AL中的三个基本标准,同时考虑了信息,代表性和多样性。使用三种不同的线性回归模型(脊回归,套索和线性支持向量回归),对来自各个应用领域的14个数据集进行了广泛的实验,证明了我们提出的方法的有效性。
In many real-world machine learning applications, unlabeled data can be easily obtained, but it is very time-consuming and/or expensive to label them. So, it is desirable to be able to select the optimal samples to label, so that a good machine learning model can be trained from a minimum amount of labeled data. Active learning (AL) has been widely used for this purpose. However, most existing AL approaches are supervised: they train an initial model from a small amount of labeled samples, query new samples based on the model, and then update the model iteratively. Few of them have considered the completely unsupervised AL problem, i.e., starting from zero, how to optimally select the very first few samples to label, without knowing any label information at all. This problem is very challenging, as no label information can be utilized. This paper studies unsupervised pool-based AL for linear regression problems. We propose a novel AL approach that considers simultaneously the informativeness, representativeness, and diversity, three essential criteria in AL. Extensive experiments on 14 datasets from various application domains, using three different linear regression models (ridge regression, LASSO, and linear support vector regression), demonstrated the effectiveness of our proposed approach.