为什么基于伪标签的半监督学习算法有效？

论文标题

为什么基于伪标签的半监督学习算法有效？

Why the pseudo label based semi-supervised learning algorithm is effective?

论文作者

Min, Zeping, Ge, Qian, Tai, Cheng

论文摘要

最近，基于伪标签的半监督学习在许多领域取得了巨大的成功。基于伪标签的半监督学习算法的核心思想是使用在标记数据上训练的模型在未标记的数据上生成伪标签，然后训练模型以适合先前生成的伪标签。我们提供了一个理论分析，说明为什么基于伪标签的半监督学习在本文中有效。我们主要比较在两个设置下训练的模型的概括误差：（1）有n个标记的数据。（2）有N未标记的数据和合适的初始模型。我们的分析表明，首先，当未标记的数据的量趋向于无穷大时，基于伪标签的半监督学习算法可以获得具有与通常在标签数据量的情况下通过正常培训获得的模型相同的通用误差上限的模型。更重要的是，我们证明，当未标记的数据的量足够大时，基于伪标签的半监督学习算法获得的模型的概括误差上限可以将带有线性收敛速率的最佳上限收敛到最佳上限。我们还为采样复杂性提供了下限，以达到线性收敛速率。我们的分析有助于理解基于伪标签的半监督学习的经验成功。

Recently, pseudo label based semi-supervised learning has achieved great success in many fields. The core idea of the pseudo label based semi-supervised learning algorithm is to use the model trained on the labeled data to generate pseudo labels on the unlabeled data, and then train a model to fit the previously generated pseudo labels. We give a theory analysis for why pseudo label based semi-supervised learning is effective in this paper. We mainly compare the generalization error of the model trained under two settings: (1) There are N labeled data. (2) There are N unlabeled data and a suitable initial model. Our analysis shows that, firstly, when the amount of unlabeled data tends to infinity, the pseudo label based semi-supervised learning algorithm can obtain model which have the same generalization error upper bound as model obtained by normally training in the condition of the amount of labeled data tends to infinity. More importantly, we prove that when the amount of unlabeled data is large enough, the generalization error upper bound of the model obtained by pseudo label based semi-supervised learning algorithm can converge to the optimal upper bound with linear convergence rate. We also give the lower bound on sampling complexity to achieve linear convergence rate. Our analysis contributes to understanding the empirical successes of pseudo label-based semi-supervised learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题