Paper reviews

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

TL;DR The authors present two main results: a thorough mathematical analysis on how SGD performs variational inference and what its steady state behavior looks like: limit cycles. They present empirical quantities similar to the ones we have measured and analyze those compared to the null of Brownian motion.

A Variational Analysis of Stochastic Gradient Algorithms

TL;DR Rethinking SGD in the limit of continuous time yields valuable insight, particularly on hyperparameter tuning. This paper introduces the SDE derivation in the previously reviewed 'Three Factors' paper, and elaborates on the minimization of the KL divergence of the stationary distribution of the underlying OU process and the target posterior (and as such relies on the Bayesian view on ML algorithms, rather than the optimization view).

Three Factors Influencing Minima in SGD

TL;DR Batch size, learning rate and gradient covariance influence minima. LR/BS ratio is key in the width of the minima, impacting generalization. SGD as SDE discretization. Experimental validation of theory.

Spherical Motion Dynamics of Deep Neural Networks with Batch Normalization and Weight Decay

TL;DR DNNs trained with weight decay and batch normalization reach learning equilibrium on the surface of a sphere in parameter space and their limit angular update can be computed a priori.