Theory

Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous Diffusion

In this work we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after performance has converged, networks continue to move through parameter space by a process of …

Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

Predicting the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in high-dimensional parameter space undergoes …

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Review Summary Variantional inference Proof that SGD minimizes a potential along with an entropic regularization term. However, this potential differs from the loss used to compute backpropagation gradients. They are only equal if the gradient noise were isotropic (i.

A Variational Analysis of Stochastic Gradient Algorithms

Review Summary The authors expand on their previous work on the continuous time-limit of SGD. They show how SGD with a constant LR can be modelled as an SDE that reaches a stationary distribution.

Three Factors Influencing Minima in SGD

Review Summary SGD performs similarly for different batch sizes, but a constant LR/BS ratio. The authors note that SGD with the same LR/BS ratio are different discretizations of the same Stochastic Differential Equation.

Spherical Motion Dynamics of Deep Neural Networks with Batch Normalization and Weight Decay

Review Summary Batch normalization induces scale invariance of the loss with respect to the weights (i.e. $L(x; \theta) = L(x; k \theta)$, for parameters $\theta$ with BN. This expression is not mathematically precise).