深层正规化和直接训练神经网络的内部层有内核流量

论文标题

深层正规化和直接训练神经网络的内部层有内核流量

Deep regularization and direct training of the inner layers of Neural Networks with Kernel Flows

论文作者

Yoo, Gene Ryan, Owhadi, Houman

论文摘要

我们基于内核流（KFS）引入了一种针对人工神经网络（ANN）的新正则化方法。基于将数据集随机批次插值点的数量减半，将KFS作为回归/Kriging中的内核选择方法引入。写作$f_θ（x）= \ big（f^{（n）} _ {θ_n} \ circ f^{（n-1）} _ {θ_{θ_{n-1}} \ circ \ dots \ circ \ circ f^{（1）} _ {θ_1} _ {θ_1} _ {θ_1} _ {θ_1} \ big big big big）$ $ h^{（i）}（x）= \ big（f^{（i）} _ {θ_i} \ Circ f^{（i-1）} _ {θ_{θ_{I-1}} \ circ \ circ \ dots \ dots \ circ f^{（1）} _ {1）} _ {θ_1} \ big） $ k^{（i）}（x，x'）= \ exp（-γ_i\ | h^{（i）}（x）-h^{（i）}（x'）（x'）\ | _2^2）$。当与一批数据集结合使用时，这些内核会产生kf损失$ e_2^{（i）} $（$ l^2 $回归错误，通过使用批次的随机一半预测另一半），具体取决于内层的参数$θ_1，$θ_1，\ ldots，fldots，fldots，fldots，fldots，θ_i$ $ $ $ $（和$γ_i$）。所提出的方法简单地包括将这些KF损失的子集与经典的输出损失一起汇总。我们在不改变结构或输出分类器的情况下测试了CNN和WRN的提出方法，并报告了测试误差减少，概括差距的减少以及对分布转移的鲁棒性提高而没有计算复杂性的显着增加。 We suspect that these results might be explained by the fact that while conventional training only employs a linear functional (a generalized moment) of the empirical distribution defined by the dataset and can be prone to trapping in the Neural Tangent Kernel regime (under over-parameterizations), the proposed loss function (defined as a nonlinear functional of the empirical distribution) effectively trains the underlying kernel defined by the CNN beyond使用该内核回归数据。

We introduce a new regularization method for Artificial Neural Networks (ANNs) based on Kernel Flows (KFs). KFs were introduced as a method for kernel selection in regression/kriging based on the minimization of the loss of accuracy incurred by halving the number of interpolation points in random batches of the dataset. Writing $f_θ(x) = \big(f^{(n)}_{θ_n}\circ f^{(n-1)}_{θ_{n-1}} \circ \dots \circ f^{(1)}_{θ_1}\big)(x)$ for the functional representation of compositional structure of the ANN, the inner layers outputs $h^{(i)}(x) = \big(f^{(i)}_{θ_i}\circ f^{(i-1)}_{θ_{i-1}} \circ \dots \circ f^{(1)}_{θ_1}\big)(x)$ define a hierarchy of feature maps and kernels $k^{(i)}(x,x')=\exp(- γ_i \|h^{(i)}(x)-h^{(i)}(x')\|_2^2)$. When combined with a batch of the dataset these kernels produce KF losses $e_2^{(i)}$ (the $L^2$ regression error incurred by using a random half of the batch to predict the other half) depending on parameters of inner layers $θ_1,\ldots,θ_i$ (and $γ_i$). The proposed method simply consists in aggregating a subset of these KF losses with a classical output loss. We test the proposed method on CNNs and WRNs without alteration of structure nor output classifier and report reduced test errors, decreased generalization gaps, and increased robustness to distribution shift without significant increase in computational complexity. We suspect that these results might be explained by the fact that while conventional training only employs a linear functional (a generalized moment) of the empirical distribution defined by the dataset and can be prone to trapping in the Neural Tangent Kernel regime (under over-parameterizations), the proposed loss function (defined as a nonlinear functional of the empirical distribution) effectively trains the underlying kernel defined by the CNN beyond regressing the data with that kernel.

下载PDF全文

下载文献需遵守相关版权规定

论文标题