论文标题
使用复杂的方向导数的二阶神经网络训练
Second-order Neural Network Training Using Complex-step Directional Derivative
论文作者
论文摘要
虽然众所周知的二阶优化方法(例如牛顿方法)的出色性能是在实践中很难进行的,因为它们都不适合大规模问题组装Hessian Matrix和计算其倒数。现有的二阶方法求助于Hessian的各种对角线或低级别近似值,这些方法通常无法捕获必要的曲率信息以产生实质性的改进。另一方面,当培训基于批处理(即随机)时,除非采用昂贵的保障措施,否则嘈杂的二阶信息很容易污染培训程序。在本文中,我们采用了用于二阶神经网络培训的数值算法。我们通过使用复杂的步骤有限差(CSFD)来解决Hessian计算的实际障碍 - 数值过程,将虚构的扰动添加到衍生函数中以进行衍生计算。 CSFD高度稳健,高效和准确(与分析结果一样准确)。这种方法使我们能够从字面上应用任何已知的二阶优化方法进行深度学习培训。基于它,我们设计了一个有效的牛顿Krylov程序。关键机制是在找到不安的方向后立即终止随机的Krylov迭代,以避免进行不必要的计算。在优化期间,我们监视泰勒膨胀中的近似误差以调节步长。该策略结合了线路搜索和信任区域方法的优势,使我们的方法同时保留了良好的本地和全球融合。我们已经在各种深度学习任务中测试了我们的方法。实验表明,我们的方法优于退出方法,并且通常会更快地收敛一阶。我们认为,我们的方法将激发深入学习和数值优化的广泛新算法。
While the superior performance of second-order optimization methods such as Newton's method is well known, they are hardly used in practice for deep learning because neither assembling the Hessian matrix nor calculating its inverse is feasible for large-scale problems. Existing second-order methods resort to various diagonal or low-rank approximations of the Hessian, which often fail to capture necessary curvature information to generate a substantial improvement. On the other hand, when training becomes batch-based (i.e., stochastic), noisy second-order information easily contaminates the training procedure unless expensive safeguard is employed. In this paper, we adopt a numerical algorithm for second-order neural network training. We tackle the practical obstacle of Hessian calculation by using the complex-step finite difference (CSFD) -- a numerical procedure adding an imaginary perturbation to the function for derivative computation. CSFD is highly robust, efficient, and accurate (as accurate as the analytic result). This method allows us to literally apply any known second-order optimization methods for deep learning training. Based on it, we design an effective Newton Krylov procedure. The key mechanism is to terminate the stochastic Krylov iteration as soon as a disturbing direction is found so that unnecessary computation can be avoided. During the optimization, we monitor the approximation error in the Taylor expansion to adjust the step size. This strategy combines advantages of line search and trust region methods making our method preserves good local and global convergency at the same time. We have tested our methods in various deep learning tasks. The experiments show that our method outperforms exiting methods, and it often converges one-order faster. We believe our method will inspire a wide-range of new algorithms for deep learning and numerical optimization.