Shampoo vs. Adam vs. Muon: Optimizer Preconditioning Comparison

What Each Does

Adam tracks per-parameter running averages of the gradient ( $m_t$ ) and squared gradient ( $v_t$ ). The update divides by $\sqrt{v_t}$ , giving each parameter its own adaptive learning rate. This is diagonal preconditioning: it ignores correlations between parameters.

Shampoo tracks full left and right covariance matrices $L_t = \sum G_t G_t^\top$ and $R_t = \sum G_t^\top G_t$ for each weight matrix. The update applies $L_t^{-1/4} G_t R_t^{-1/4}$ , accounting for cross-parameter correlations. This is full-matrix Kronecker preconditioning: an approximation to the natural gradient.

Muon orthogonalizes the gradient update using Newton-Schulz iterations, which approximate the polar factor of the gradient matrix. This projects the update onto the Stiefel manifold of orthogonal matrices. This is manifold preconditioning: it constrains the update geometry rather than scaling it.

Side-by-Side Comparison

Property	Adam	Shampoo	Muon
Preconditioner	Diagonal ( $\text{diag}(v_t)^{-1/2}$ )	Kronecker ( $L^{-1/4} \cdot R^{-1/4}$ )	Orthogonal (polar factor)
Curvature captured	Per-parameter scale	Cross-parameter correlations	Update geometry
Cost per step	$O(d)$	$O(m^3 + n^3)$ per $m \times n$ matrix	$O(mn)$ (NS iterations)
Memory overhead	$2d$ (two moments)	$m^2 + n^2$ per weight matrix	$O(mn)$ per weight matrix
Steps to convergence	Baseline	Fewer (better conditioning)	Fewer on transformers
Wall-clock efficiency	Best for small models	Best for large, ill-conditioned models	Competitive on transformers
Theory connection	Diagonal natural gradient	Kronecker-factored natural gradient	Riemannian optimization on Stiefel manifold

When Each Wins

Adam wins when:

The model is small or the per-step compute budget is tight
The loss landscape is well-conditioned (small models, simple architectures)
You need a well-understood, debugged optimizer with decades of practical usage
AdamW is the correct choice for most practitioners most of the time

Shampoo wins when:

The model has large weight matrices with strong cross-parameter correlations
The compute budget per step is not the bottleneck (e.g., data loading dominates)
You can amortize the matrix power computation over many steps (e.g., recompute every 100 steps)
Google-scale training where even small per-step improvements compound

Muon wins when:

Training transformers specifically (where orthogonalized updates match the weight matrix structure)
You want better conditioning than Adam without the $O(m^3)$ cost of Shampoo
The Newton-Schulz iterations (typically 5 iterations of $O(mn)$ each) are affordable

The $-1/4$ Exponent in Shampoo

Adam uses $v_t^{-1/2}$ : the inverse square root of the second moment. Shampoo uses $L^{-1/4}$ and $R^{-1/4}$ : the inverse fourth root. This is not arbitrary. The Fisher information matrix for a weight matrix $W$ factorizes as $F \approx L \otimes R$ . The natural gradient uses $F^{-1/2}$ . Since $(L \otimes R)^{-1/2} = L^{-1/2} \otimes R^{-1/2}$ , and the matrix application $L^{-1/2} G R^{-1/2}$ on the vectorized weight becomes $L^{-1/4} G R^{-1/4}$ when applied to the matrix directly, the exponent $-1/4$ is derived, not tuned.

Common Confusions

Watch Out

Muon is not just Adam with an extra step

Adam and Muon have structurally different update rules. Adam rescales each parameter independently. Muon orthogonalizes the entire gradient matrix, which changes the direction of the update, not just its scale. The orthogonalization step (Newton-Schulz) is the core innovation, not a post-processing trick.

Watch Out

Shampoo is not always better than Adam on wall-clock time

Shampoo typically converges in fewer steps, but each step is more expensive ( $O(m^3)$ for the matrix power). On small models where Adam's per-step cost dominates, Adam finishes faster. On large models where the matrix power is amortized, Shampoo can win. The crossover point depends on model size, hardware, and amortization frequency.

References

Kingma & Ba, "Adam: A Method for Stochastic Optimization" (ICLR 2015)
Gupta, Koren, Singer, "Shampoo: Preconditioned Stochastic Tensor Optimization" (ICML 2018)
Bernstein & Newhouse, "Old Optimizer, New Norm: An Anthology" (2024). Muon.
Anil et al., "Scalable Second Order Optimization for Deep Learning" (2021). Distributed Shampoo.