Learning Dynamics

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Review Summary Variantional inference Proof that SGD minimizes a potential along with an entropic regularization term. However, this potential differs from the loss used to compute backpropagation gradients. They are only equal if the gradient noise were isotropic (i.

A Variational Analysis of Stochastic Gradient Algorithms

Review Summary The authors expand on their previous work on the continuous time-limit of SGD. They show how SGD with a constant LR can be modelled as an SDE that reaches a stationary distribution.

Three Factors Influencing Minima in SGD

Review Summary SGD performs similarly for different batch sizes, but a constant LR/BS ratio. The authors note that SGD with the same LR/BS ratio are different discretizations of the same Stochastic Differential Equation.

Spherical Motion Dynamics of Deep Neural Networks with Batch Normalization and Weight Decay

Review Summary Batch normalization induces scale invariance of the loss with respect to the weights (i.e. $L(x; \theta) = L(x; k \theta)$, for parameters $\theta$ with BN. This expression is not mathematically precise).