论文标题
未标记的数据如何改善自我训练的概括?一个隐藏的理论分析
How does unlabeled data improve generalization in self-training? A one-hidden-layer theoretical analysis
论文作者
论文摘要
自我训练是一种半监督的学习算法,当标记的数据受到限制时,利用大量未标记的数据来改善学习。尽管取得了经验成功,但其理论表征仍然难以捉摸。据我们所知,这项工作为已知的迭代自我训练范式建立了第一个理论分析,并证明了未标记数据在培训收敛和概括能力中的好处。为了使我们的理论分析可行,我们专注于一个隐藏的神经网络。但是,即使对于浅神经网络,对迭代自我训练的理论理解也是不平凡的。关键的挑战之一是,基于监督学习的现有神经网络景观分析不再存在于(半监督)的自训练范式中。我们应对这一挑战,并证明迭代自我训练与收敛率和概括精度的线性收敛,并以$ 1/\ sqrt {m} $的顺序提高,其中$ m $是未标记的样本的数量。还提供了从浅神经网络到深神经网络的实验,以证明我们已建立的关于自我训练的理论见解的正确性。
Self-training, a semi-supervised learning algorithm, leverages a large amount of unlabeled data to improve learning when the labeled data are limited. Despite empirical successes, its theoretical characterization remains elusive. To the best of our knowledge, this work establishes the first theoretical analysis for the known iterative self-training paradigm and proves the benefits of unlabeled data in both training convergence and generalization ability. To make our theoretical analysis feasible, we focus on the case of one-hidden-layer neural networks. However, theoretical understanding of iterative self-training is non-trivial even for a shallow neural network. One of the key challenges is that existing neural network landscape analysis built upon supervised learning no longer holds in the (semi-supervised) self-training paradigm. We address this challenge and prove that iterative self-training converges linearly with both convergence rate and generalization accuracy improved in the order of $1/\sqrt{M}$, where $M$ is the number of unlabeled samples. Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.