论文标题
渠道蒸馏:知识蒸馏的渠道关注
Channel Distillation: Channel-Wise Attention for Knowledge Distillation
论文作者
论文摘要
知识蒸馏是将知识从教师网络所学的数据转移到学生网络,以便学生具有更少的参数和更少的计算的优势,并且准确性与教师接近。在本文中,我们提出了一种新的蒸馏方法,其中包含两种转移蒸馏策略和一个损失策略。第一个转移策略是基于通道注意的,称为通道蒸馏(CD)。 CD将渠道信息从老师传输到学生。第二个是引导知识蒸馏(GKD)。与知识蒸馏(KD)不同,该学生可以模仿每个样本的教师预测分布,GKD仅使学生能够模仿教师的正确输出。最后一部分是早期衰败老师(EDT)。在训练过程中,我们逐渐衰减蒸馏损失的重量。目的是使学生能够逐渐控制优化而不是老师。我们提出的方法在Imagenet和Cifar100上进行了评估。在ImageNet上,我们使用RESNET18实现了27.68%的TOP-1误差,这表现优于最先进的方法。在CIFAR100上,我们取得了令人惊讶的结果,使学生表现优于老师。代码可从https://github.com/zhouzaida/channel-distillation获得。
Knowledge distillation is to transfer the knowledge from the data learned by the teacher network to the student network, so that the student has the advantage of less parameters and less calculations, and the accuracy is close to the teacher. In this paper, we propose a new distillation method, which contains two transfer distillation strategies and a loss decay strategy. The first transfer strategy is based on channel-wise attention, called Channel Distillation (CD). CD transfers the channel information from the teacher to the student. The second is Guided Knowledge Distillation (GKD). Unlike Knowledge Distillation (KD), which allows the student to mimic each sample's prediction distribution of the teacher, GKD only enables the student to mimic the correct output of the teacher. The last part is Early Decay Teacher (EDT). During the training process, we gradually decay the weight of the distillation loss. The purpose is to enable the student to gradually control the optimization rather than the teacher. Our proposed method is evaluated on ImageNet and CIFAR100. On ImageNet, we achieve 27.68% of top-1 error with ResNet18, which outperforms state-of-the-art methods. On CIFAR100, we achieve surprising result that the student outperforms the teacher. Code is available at https://github.com/zhouzaida/channel-distillation.