通过完整内核矩阵转移改善知识蒸馏

论文标题

通过完整内核矩阵转移改善知识蒸馏

Improved Knowledge Distillation via Full Kernel Matrix Transfer

论文作者

Qian, Qi, Li, Hao, Hu, Juhua

论文摘要

知识蒸馏是深度学习中模型压缩的有效方法。给定大型模型（即教师模型），它旨在通过从教师那里传输信息来提高紧凑型模型（即学生模型）的性能。已经研究了各种蒸馏信息。最近，许多作品建议将示例之间的成对相似性传递到提炼相对信息。但是，大多数努力都致力于开发不同的相似性测量值，而在每次迭代中，只有一个由小批量组成的小矩阵也可以在整个数据集中优化成对相似性。在这项工作中，我们旨在有效地传输完整的相似性矩阵。主要的挑战是来自示例数量二次的完整矩阵的大小。为了应对挑战，我们用nyStr {Ö} m方法分解了原始的完整矩阵。通过选择适当的地标点，我们的理论分析表明，可以进一步简化转移的损失。具体而言，我们发现，教师和学生之间原始的完整内核矩阵之间的差异可以受到相应的部分矩阵的界限，这仅包括原始示例和地标点之间的相似之处。与完整的矩阵相比，示例数量中的部分矩阵的大小是线性的，这显着提高了优化的效率。基准数据集的实证研究表明了所提出的算法的有效性。代码可在\ url {https://github.com/idstcv/kda}上找到。

Knowledge distillation is an effective way for model compression in deep learning. Given a large model (i.e., teacher model), it aims to improve the performance of a compact model (i.e., student model) by transferring the information from the teacher. Various information for distillation has been studied. Recently, a number of works propose to transfer the pairwise similarity between examples to distill relative information. However, most of efforts are devoted to developing different similarity measurements, while only a small matrix consisting of examples within a mini-batch is transferred at each iteration that can be inefficient for optimizing the pairwise similarity over the whole data set. In this work, we aim to transfer the full similarity matrix effectively. The main challenge is from the size of the full matrix that is quadratic to the number of examples. To address the challenge, we decompose the original full matrix with Nystr{ö}m method. By selecting appropriate landmark points, our theoretical analysis indicates that the loss for transfer can be further simplified. Concretely, we find that the difference between the original full kernel matrices between teacher and student can be well bounded by that of the corresponding partial matrices, which only consists of similarities between original examples and landmark points. Compared with the full matrix, the size of the partial matrix is linear in the number of examples, which improves the efficiency of optimization significantly. The empirical study on benchmark data sets demonstrates the effectiveness of the proposed algorithm. Code is available at \url{https://github.com/idstcv/KDA}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题