Skip to main content

Comparison

Shampoo vs. Adam vs. Muon

Three approaches to preconditioning gradients: Adam (diagonal, per-parameter), Shampoo (full-matrix Kronecker), and Muon (orthogonalized updates via Newton-Schulz). Each uses increasingly rich curvature information at increasing computational cost.

What Each Does

Adam tracks per-parameter running averages of the gradient (mtm_t) and squared gradient (vtv_t). The update divides by vt\sqrt{v_t}, giving each parameter its own adaptive learning rate. This is diagonal preconditioning: it ignores correlations between parameters.

Shampoo tracks full left and right covariance matrices Lt=GtGtL_t = \sum G_t G_t^\top and Rt=GtGtR_t = \sum G_t^\top G_t for each weight matrix. The update applies Lt1/4GtRt1/4L_t^{-1/4} G_t R_t^{-1/4}, accounting for cross-parameter correlations. This is full-matrix Kronecker preconditioning: an approximation to the natural gradient.

Muon orthogonalizes the gradient update using Newton-Schulz iterations, which approximate the polar factor of the gradient matrix. This projects the update onto the Stiefel manifold of orthogonal matrices. This is manifold preconditioning: it constrains the update geometry rather than scaling it.

Side-by-Side Comparison

PropertyAdamShampooMuon
PreconditionerDiagonal (diag(vt)1/2\text{diag}(v_t)^{-1/2})Kronecker (L1/4R1/4L^{-1/4} \cdot R^{-1/4})Orthogonal (polar factor)
Curvature capturedPer-parameter scaleCross-parameter correlationsUpdate geometry
Cost per stepO(d)O(d)O(m3+n3)O(m^3 + n^3) per m×nm \times n matrixO(mn)O(mn) (NS iterations)
Memory overhead2d2d (two moments)m2+n2m^2 + n^2 per weight matrixO(mn)O(mn) per weight matrix
Steps to convergenceBaselineFewer (better conditioning)Fewer on transformers
Wall-clock efficiencyBest for small modelsBest for large, ill-conditioned modelsCompetitive on transformers
Theory connectionDiagonal natural gradientKronecker-factored natural gradientRiemannian optimization on Stiefel manifold

When Each Wins

Adam wins when:

  • The model is small or the per-step compute budget is tight
  • The loss landscape is well-conditioned (small models, simple architectures)
  • You need a well-understood, debugged optimizer with decades of practical usage
  • AdamW is the correct choice for most practitioners most of the time

Shampoo wins when:

  • The model has large weight matrices with strong cross-parameter correlations
  • The compute budget per step is not the bottleneck (e.g., data loading dominates)
  • You can amortize the matrix power computation over many steps (e.g., recompute every 100 steps)
  • Google-scale training where even small per-step improvements compound

Muon wins when:

  • Training transformers specifically (where orthogonalized updates match the weight matrix structure)
  • You want better conditioning than Adam without the O(m3)O(m^3) cost of Shampoo
  • The Newton-Schulz iterations (typically 5 iterations of O(mn)O(mn) each) are affordable

The 1/4-1/4 Exponent in Shampoo

Adam uses vt1/2v_t^{-1/2}: the inverse square root of the second moment. Shampoo uses L1/4L^{-1/4} and R1/4R^{-1/4}: the inverse fourth root. This is not arbitrary. The Fisher information matrix for a weight matrix WW factorizes as FLRF \approx L \otimes R. The natural gradient uses F1/2F^{-1/2}. Since (LR)1/2=L1/2R1/2(L \otimes R)^{-1/2} = L^{-1/2} \otimes R^{-1/2}, and the matrix application L1/2GR1/2L^{-1/2} G R^{-1/2} on the vectorized weight becomes L1/4GR1/4L^{-1/4} G R^{-1/4} when applied to the matrix directly, the exponent 1/4-1/4 is derived, not tuned.

Common Confusions

Watch Out

Muon is not just Adam with an extra step

Adam and Muon have structurally different update rules. Adam rescales each parameter independently. Muon orthogonalizes the entire gradient matrix, which changes the direction of the update, not just its scale. The orthogonalization step (Newton-Schulz) is the core innovation, not a post-processing trick.

Watch Out

Shampoo is not always better than Adam on wall-clock time

Shampoo typically converges in fewer steps, but each step is more expensive (O(m3)O(m^3) for the matrix power). On small models where Adam's per-step cost dominates, Adam finishes faster. On large models where the matrix power is amortized, Shampoo can win. The crossover point depends on model size, hardware, and amortization frequency.

References

  • Kingma & Ba, "Adam: A Method for Stochastic Optimization" (ICLR 2015)
  • Gupta, Koren, Singer, "Shampoo: Preconditioned Stochastic Tensor Optimization" (ICML 2018)
  • Bernstein & Newhouse, "Old Optimizer, New Norm: An Anthology" (2024). Muon.
  • Anil et al., "Scalable Second Order Optimization for Deep Learning" (2021). Distributed Shampoo.