论文标题

深神经网络中梯度规范平等的全面统计框架

A Comprehensive and Modularized Statistical Framework for Gradient Norm Equality in Deep Neural Networks

论文作者

Chen, Zhaodong, Deng, Lei, Wang, Bangyan, Li, Guoqi, Xie, Yuan

论文摘要

近年来,已经提出了许多指标来确定没有梯度爆炸和消失的网络。但是,由于网络组件的多样性和现代DNN中复杂的串行 - 平行混合连接,对现有指标的评估通常需要强大的假设,复杂的统计分析或具有有限的应用程序领域,从而限制了它们在社区中的传播。在本文中,受到梯度规范平等和动态等轴测图的启发,我们首先提出了一种称为块动力学等轴测图的新型度量,该指标衡量了单个块中梯度标准的变化。由于我们的块动力学等距基于规范,因此与原始动力学等轴测图相比,其评估需要更弱的假设。为了减轻挑战性推导,我们提出了一个基于自由概率的高度模块化统计框架。我们的框架包括几个关键定理,用于处理复杂的序列 - 平行混合连接和库,以涵盖网络组件的多样性。此外,还提供了几个足够的先决条件。在我们的指标和框架中提供支持,我们分析了广泛的初始化,标准化和网络结构。我们发现梯度规范平等是他们背后的一种普遍哲学。然后,我们根据我们的分析改进了一些现有方法,包括针对初始化技术的激活功能选择策略,用于重量归一化的新配置以及一种导致SELU系数的深度感知方法。此外,我们提出了一种名为“第二矩归一化”的新型归一化技术,从理论上讲,该技术比批次归一化的速度快30%而没有准确的损失。最后但并非最不重要的一点是,我们的结论和方法是通过在CIFAR10和Imagenet上的多个模型上进行的广泛实验来证明的。

In recent years, plenty of metrics have been proposed to identify networks that are free of gradient explosion and vanishing. However, due to the diversity of network components and complex serial-parallel hybrid connections in modern DNNs, the evaluation of existing metrics usually requires strong assumptions, complex statistical analysis, or has limited application fields, which constraints their spread in the community. In this paper, inspired by the Gradient Norm Equality and dynamical isometry, we first propose a novel metric called Block Dynamical Isometry, which measures the change of gradient norm in individual block. Because our Block Dynamical Isometry is norm-based, its evaluation needs weaker assumptions compared with the original dynamical isometry. To mitigate the challenging derivation, we propose a highly modularized statistical framework based on free probability. Our framework includes several key theorems to handle complex serial-parallel hybrid connections and a library to cover the diversity of network components. Besides, several sufficient prerequisites are provided. Powered by our metric and framework, we analyze extensive initialization, normalization, and network structures. We find that Gradient Norm Equality is a universal philosophy behind them. Then, we improve some existing methods based on our analysis, including an activation function selection strategy for initialization techniques, a new configuration for weight normalization, and a depth-aware way to derive coefficients in SeLU. Moreover, we propose a novel normalization technique named second moment normalization, which is theoretically 30% faster than batch normalization without accuracy loss. Last but not least, our conclusions and methods are evidenced by extensive experiments on multiple models over CIFAR10 and ImageNet.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源