ML Methods
Activation Functions
Nonlinear activation functions in neural networks: sigmoid, tanh, ReLU, Leaky ReLU, GELU, and SiLU. Their gradients, saturation behavior, and impact on trainability.
Prerequisites
Why This Matters
Without nonlinear activations, a feedforward network of any depth computes only an affine function. A composition of affine maps is affine. The activation function is what gives neural networks their representational power.
The choice of activation directly controls gradient flow during backpropagation. Sigmoid and tanh saturate for large inputs, driving gradients toward zero and stalling learning in deep networks. ReLU solved this for positive inputs but introduced dead neurons. Modern activations like GELU and SiLU provide smooth alternatives with better empirical performance in transformers. Proper weight initialization and batch normalization also help control gradient magnitude.
Formal Setup
Consider a feedforward network with layers. Layer computes:
where is the activation function applied elementwise, is the weight matrix, is the bias vector, is the pre-activation, and is the post-activation.
The gradient of the loss with respect to depends on . If is small, gradients vanish. If is large, gradients explode.
Core Definitions
Sigmoid
Output range: . Derivative: . Maximum derivative is at .
Hyperbolic Tangent
Output range: . Zero-centered. Derivative: . Maximum derivative is at .
ReLU
Output range: . Derivative: for , for , undefined at (conventionally set to ).
Leaky ReLU
where is a small positive constant, typically . Derivative: for , for .
GELU
where is the standard normal CDF. Smooth function that weights inputs by their percentile under a Gaussian. Used in BERT, GPT-2, and most modern transformers.
Exact form: , with the Gaussian CDF. Fast approximation (Hendrycks & Gimpel 2016): Frameworks sometimes default to the approximation for speed; numerical outputs differ at the third decimal place.
SiLU / Swish
where is the sigmoid function. Smooth, non-monotone (slightly negative for ). Used in architectures like EfficientNet and LLaMA.
SwiGLU
Gated linear unit variant (Shazeer 2020) built from SiLU and an elementwise-product gate. The feedforward block becomes , doubling the parameter count of a plain MLP of the same hidden width, so implementations typically shrink the gated width by to match the FLOPs of a standard FFN. Used in LLaMA 1/2/3, Mistral, Gemma, and PaLM. A plain activation function page is incomplete without it since it is the FFN nonlinearity in most modern open-weight LLMs.
Main Theorems
Nonlinearity Is Necessary for Universal Approximation
Statement
If (the identity function), then a feedforward network of any depth computes an affine function for some matrix and vector . The network cannot approximate arbitrary continuous functions.
Intuition
Each layer computes . Composing such maps gives , which is a single affine map. Depth without nonlinearity adds no representational power.
Proof Sketch
By induction on . Base case: a single affine layer is affine. Inductive step: if , then , which is affine.
Why It Matters
This result explains why activation functions exist at all. The universal approximation theorem (Cybenko 1989, Hornik 1991) shows that a single hidden layer with a non-polynomial activation can approximate any continuous function on a compact set. Without nonlinearity, no amount of depth helps.
Failure Mode
The result says nonlinearity is necessary but says nothing about which nonlinearity is best. All non-polynomial activations yield universal approximation in theory. The practical differences are about gradient flow and optimization, not expressiveness.
Gradient Flow Analysis
Sigmoid and Tanh: Vanishing Gradients
For a full treatment of gradient pathologies in deep networks, see gradient flow and vanishing gradients. The core mechanism is the product structure of the chain rule.
During backpropagation through layers, the gradient is a product of per-layer Jacobians, each containing the activation derivative and the weight matrix:
For sigmoid, everywhere, so the activation-derivative contribution shrinks by at most . The full gradient also depends on , which can amplify or further attenuate the signal. With layers and well-conditioned weights, the activation-derivative factor alone shrinks the gradient by at least , dominating any reasonable amplification from the weights. Careful initialization (Xavier/Glorot) targets unit-variance signals but cannot rescue deep sigmoid networks because the cap is multiplicative.
Tanh is better: , so near-zero pre-activations pass gradients without attenuation. But for large , and the same saturation problem occurs.
ReLU: Constant Gradients, But Dead Neurons
ReLU has derivative for positive inputs. No saturation, no vanishing gradient for active neurons. This is why ReLU enabled training of deep networks (Krizhevsky et al. 2012).
The problem: if , the gradient is exactly . A neuron whose pre-activation is negative for all training examples receives zero gradient updates and never recovers. This is the dying ReLU problem.
Leaky ReLU: No Dead Neurons
By setting for , Leaky ReLU ensures every neuron receives a nonzero gradient. In practice, is standard. Parametric ReLU (PReLU) learns from data.
Gradient Flow Properties
The practical differences among activation functions come down to three measurable quantities: the magnitude of the derivative, the range of inputs that produce near-zero gradients, and how the gradient distribution evolves across layers during training.
Derivative magnitudes
| Activation | Dead zone | |
|---|---|---|
| Sigmoid | at | or |
| Tanh | at | |
| ReLU | for | (permanently) |
| Leaky ReLU () | for | none () |
| GELU | at | asymptotically for |
| SiLU | at | asymptotically for |
The sigmoid maximum of is the core problem for deep networks. Over layers, the activation-derivative factor in the gradient is bounded by , and the full gradient additionally accumulates products of weight-matrix Jacobians. This is the vanishing gradient problem in quantitative form. Tanh has maximum derivative 1, so gradients do not shrink from activation derivatives alone when pre-activations are near zero; the problem only appears for saturated neurons.
ReLU dying neurons: a permanent failure mode
A ReLU neuron "dies" when its pre-activation is negative for every training example. Once the neuron is dead, the gradient through it is zero, so the weights never update, and the neuron stays dead. This can affect a significant fraction of neurons depending on initialization and learning rate.
Batch normalization mitigates dying ReLUs by normalizing pre-activations, keeping most neurons in the active () region. Weight initialization (He initialization, which scales weights by ) is specifically designed for ReLU to set the pre-activation variance to approximately 1 at initialization, keeping neurons alive at the start of training. These two interventions together largely solve the dying ReLU problem in practice, without requiring a switch to Leaky ReLU.
GELU and SiLU: smooth approximations that exceed 1
Both GELU and SiLU have peak derivatives slightly above 1 (around and respectively). This is a feature, not a bug: the small gradient amplification helps offset attenuation from earlier layers. Both functions are smooth, which makes the gradient landscape more regular and better conditioned for second-order methods.
The GELU derivative is where is the standard normal density. The SiLU derivative is . Neither has a closed-form saturation threshold; both decay gradually toward zero as , so dying neurons are possible but far less abrupt than with ReLU.
Practical choice
For convolutional networks and residual architectures: ReLU remains a solid default, with Leaky ReLU or ELU when dead neurons are a concern. For transformers and large language models: GELU and SiLU are standard, with SwiGLU dominating recent open-weight models. The performance differences among GELU, SiLU, and SwiGLU are small; the choice is mostly convention within an architecture family.
Canonical Examples
Sigmoid saturation in a 5-layer network
Consider a network with sigmoid activations and all pre-activations initialized at . Then . The gradient through 5 layers is attenuated by a factor of . Training will stall in the early layers.
With ReLU and positive pre-activations, the gradient factor is . No attenuation.
Common Confusions
ReLU is not differentiable at zero, but this does not matter in practice
ReLU has a kink at where the derivative is undefined. In practice, the probability that any pre-activation is exactly zero is zero (under continuous distributions). Implementations set by convention. This has no measurable effect on training.
All common activations yield universal approximation
The universal approximation theorem holds for any non-polynomial continuous activation. ReLU, sigmoid, tanh, GELU, and SiLU all have this property. The choice of activation does not affect what functions a network can represent in the infinite-width limit. It affects what functions gradient-based optimization will find in finite time with finite width.
GELU is not just smooth ReLU
GELU is sometimes described as a smooth approximation to ReLU. This is misleading. GELU weights each input by its probability under a standard Gaussian: where . For large positive , . For large negative , . The transition is smooth and data-dependent, unlike the hard threshold of ReLU.
Summary
- Without nonlinear activations, depth is useless: the network computes an affine map
- Sigmoid and tanh saturate for large , causing vanishing gradients in deep networks
- ReLU solved vanishing gradients for positive inputs but introduced dead neurons
- Leaky ReLU, GELU, and SiLU avoid dead neurons while maintaining good gradient flow
- Activation choice affects optimization dynamics, not theoretical expressiveness
- Modern transformers use GELU, SiLU, or SwiGLU; SwiGLU is the FFN nonlinearity in LLaMA, Mistral, Gemma, and PaLM
Exercises
Problem
Compute for the sigmoid function and show that .
Problem
A 10-layer network uses sigmoid activations. Assume all pre-activations are at (best case for gradient flow). What is the maximum gradient attenuation factor through all 10 layers from activation derivatives alone?
Problem
Show that for any , Leaky ReLU with slope for negative inputs has no dead neurons. Precisely: show that for any weight configuration, for all .
Related Comparisons
References
Canonical:
- Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 6.3 (activation functions and gradient flow)
- Cybenko, "Approximation by Superpositions of a Sigmoidal Function" (1989)
- Hornik, "Approximation Capabilities of Multilayer Feedforward Networks" (1991)
- Krizhevsky, Sutskever, Hinton, "ImageNet Classification with Deep CNNs" (NeurIPS 2012) — large-scale ReLU validation
Current:
- Hendrycks & Gimpel, "Gaussian Error Linear Units (GELUs)" (2016), arXiv:1606.08415
- Ramachandran, Zoph, Le, "Searching for Activation Functions" (2017) (Swish/SiLU), arXiv:1710.05941
- Shazeer, "GLU Variants Improve Transformer" (2020), arXiv:2002.05202 (SwiGLU)
- Nair & Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines" (2010)
- He et al., "Delving Deep into Rectifiers" (ICCV 2015) — PReLU and He initialization for ReLU networks
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Differentiation in Rⁿlayer 0A · tier 1
- Convex Optimization Basicslayer 1 · tier 1
Derived topics
4- Batch Normalizationlayer 2 · tier 1
- Feedforward Networks and Backpropagationlayer 2 · tier 1
- Weight Initializationlayer 2 · tier 1
- Kolmogorov-Arnold Networks (KANs)layer 4 · tier 2