Javier Sagastuy-Brena

Javier Sagastuy-Brena

PhD Candidate at Stanford University




I am a PhD Candidate at the Institute for Computational and Mathematical Engineering at Stanford University. I joined the Stanford Neuroscience and Artificial Intelligence Laboratory with P.I. Dan Yamins in September 2018 interested in biologically-inspired computational intelligence. My interests grew to include using computational models to understand how the brain works, along with recurrent models of the visual system, learning rules, and deep learning theory.

Before starting grad school, I spent two years working at a Mexican FinTech startup, teaching Computer Science, and doing research in machine learning for text mining. My non-academic interests include alpine skiing, cycling, hiking, cooking, and an ever-increasing obsession with coffee.

sagas [at] hey [dot] com


  • Artificial Intelligence
  • Computational Neuroscience


  • MSc in Computational and Mathematical Engineering, 2019

    Stanford University

  • BSc in Computer Engineering, 2015

    Instituto Tecnológico Autónomo de México

  • BSc in Applied Mathematics, 2015

    Instituto Tecnológico Autónomo de México

Recent Publications

Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

Predicting the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation …

Two Routes to Scalable Credit Assignment without Weight Symmetry

The neural plausibility of backpropagation has long been disputed, primarily for its use of non-local weight transport - the …

I’ve just read

Mini-reviews to help me track and remember papers I read

I got this idea from a lab mate’s website and thought I’d do something similar, though perhaps not as nice.

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

TL;DR The authors present two main results: a thorough mathematical analysis on how SGD performs variational inference and what its steady state behavior looks like: limit cycles. They present empirical quantities similar to the ones we have measured and analyze those compared to the null of Brownian motion.

A Variational Analysis of Stochastic Gradient Algorithms

TL;DR Rethinking SGD in the limit of continuous time yields valuable insight, particularly on hyperparameter tuning. This paper introduces the SDE derivation in the previously reviewed 'Three Factors' paper, and elaborates on the minimization of the KL divergence of the stationary distribution of the underlying OU process and the target posterior (and as such relies on the Bayesian view on ML algorithms, rather than the optimization view).

Three Factors Influencing Minima in SGD

TL;DR Batch size, learning rate and gradient covariance influence minima. LR/BS ratio is key in the width of the minima, impacting generalization. SGD as SDE discretization. Experimental validation of theory.



SWE Intern


Sep 2018 – Jun 2018 California

Computer Science Teacher

Modern American School

Jul 2016 – Aug 2017 CDMX, Mexico
Taught Object Oriented Programming in C#.

Tech Lead


Aug 2015 – Aug 2017 CDMX, Mexico

Research Intern


Mar 2015 – Aug 2015 Böblingen, Germany