Three Factors Influencing Minima in SGD

TL;DR

Batch size, learning rate and gradient covariance influence minima. LR/BS ratio is key in the width of the minima, impacting generalization. SGD as SDE discretization. Experimental validation of theory.

Review

Summary

  • SGD performs similarly for different batch sizes, but a constant LR/BS ratio.
  • The authors note that SGD with the same LR/BS ratio are different discretizations of the same Stochastic Differential Equation.
  • LR schedules and BS schedules are interchangeable, what matters, again, is what the LR/BS looks like.
  • Width of minima is defined in terms of the trace of the Hessian $Tr(H)$ at the minima: lower trace = wider minima.
    • Assumption 1: At a local minima, loss surface approximated via a quadratic bowl. This lets training process be approximated by Orenstein-Unhlenbcek process.
    • Assumption 2: $H$ is approximated via the covariance matrix of the stochastic gradients ($H=C$ relies on $C$ being anisotropic).
  • Larger LR/BS correlates with wider minima, giving better generalization.
  • However, larger $\beta$, with constant $\frac{LR}{BS}=\frac{\beta \eta}{\beta S}$ causes the approximation to the SDE to break down, leading to lower performance.
  • Discretization errors become aparent at large learning rates.
  • Central Limit Theorem assumptions break down for small dataset, large batches.
Javier Sagastuy-Brena
Javier Sagastuy-Brena
PhD Candidate at Stanford University

Ski, code, eat, sleep, repeat.