Activation Functions

Sneiderman, Robby

ML Methods

Activation Functions

Nonlinear activation functions in neural networks: sigmoid, tanh, ReLU, Leaky ReLU, GELU, and SiLU. Their gradients, saturation behavior, and impact on trainability.

CoreTier 1StableCore spine~45 min

Prerequisites

Differentiation in Rn Convex Optimization Basics

Quiz (36)Pulse Check Prereq Map

Why This Matters

Probe input:z = 1.5

Without nonlinear activations, a feedforward network of any depth computes only an affine function. A composition of affine maps is affine. The activation function is what gives neural networks their representational power.

The choice of activation directly controls gradient flow during backpropagation. Sigmoid and tanh saturate for large inputs, driving gradients toward zero and stalling learning in deep networks. ReLU solved this for positive inputs but introduced dead neurons. Modern activations like GELU and SiLU provide smooth alternatives with better empirical performance in transformers. Proper weight initialization and batch normalization also help control gradient magnitude.

Formal Setup

Consider a feedforward network with $L$ layers. Layer $l$ computes:

$z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}, \quad a^{(l)} = \sigma(z^{(l)})$

where $\sigma$ is the activation function applied elementwise, $W^{(l)}$ is the weight matrix, $b^{(l)}$ is the bias vector, $z^{(l)}$ is the pre-activation, and $a^{(l)}$ is the post-activation.

The gradient of the loss with respect to $z^{(l)}$ depends on $\sigma'(z^{(l)})$ . If $\sigma'$ is small, gradients vanish. If $\sigma'$ is large, gradients explode.

Core Definitions

Definition

Sigmoid $σ (x)$

$\sigma(x) = \frac{1}{1 + e^{-x}}$

Output range: $(0, 1)$ . Derivative: $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ . Maximum derivative is $1/4$ at $x = 0$ .

Definition

Hyperbolic Tangent $tanh (x)$

$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1$

Output range: $(-1, 1)$ . Zero-centered. Derivative: $\tanh'(x) = 1 - \tanh^2(x)$ . Maximum derivative is $1$ at $x = 0$ .

Definition

ReLU $ReLU (x)$

$\text{ReLU}(x) = \max(0, x)$

Output range: $[0, \infty)$ . Derivative: $1$ for $x > 0$ , $0$ for $x < 0$ , undefined at $x = 0$ (conventionally set to $0$ ).

Definition

Leaky ReLU $LeakyReLU (x)$

$\text{LeakyReLU}(x) = \max(\alpha x, x)$

where $\alpha$ is a small positive constant, typically $\alpha = 0.01$ . Derivative: $1$ for $x > 0$ , $\alpha$ for $x < 0$ .

Definition

GELU $GELU (x)$

$\text{GELU}(x) = x \cdot \Phi(x)$

where $\Phi(x)$ is the standard normal CDF. Smooth function that weights inputs by their percentile under a Gaussian. Used in BERT, GPT-2, and most modern transformers.

Exact form: $x \cdot \Phi(x)$ , with $\Phi$ the Gaussian CDF. Fast approximation (Hendrycks & Gimpel 2016): $\text{GELU}(x) \approx 0.5 x \left(1 + \tanh\left[\sqrt{2/\pi}\left(x + 0.0447 x^3\right)\right]\right).$ Frameworks sometimes default to the approximation for speed; numerical outputs differ at the third decimal place.

Definition

SiLU / Swish $SiLU (x)$

$\text{SiLU}(x) = x \cdot \sigma(x)$

where $\sigma$ is the sigmoid function. Smooth, non-monotone (slightly negative for $x \approx -1.28$ ). Used in architectures like EfficientNet and LLaMA.

Definition

SwiGLU $SwiGLU (x, W, V)$

$\text{SwiGLU}(x, W, V) = \text{SiLU}(xW) \otimes (xV)$

Gated linear unit variant (Shazeer 2020) built from SiLU and an elementwise-product gate. The feedforward block becomes $(\text{SiLU}(xW) \otimes (xV)) W_{\text{out}}$ , doubling the parameter count of a plain MLP of the same hidden width, so implementations typically shrink the gated width by $2/3$ to match the FLOPs of a standard FFN. Used in LLaMA 1/2/3, Mistral, Gemma, and PaLM. A plain activation function page is incomplete without it since it is the FFN nonlinearity in most modern open-weight LLMs.

Main Theorems

Theorem

Nonlinearity Is Necessary for Universal Approximation

Statement

If $\sigma(x) = x$ (the identity function), then a feedforward network of any depth $L$ computes an affine function $f(x) = Ax + b$ for some matrix $A$ and vector $b$ . The network cannot approximate arbitrary continuous functions.

Intuition

Each layer computes $W^{(l)} a^{(l-1)} + b^{(l)}$ . Composing $L$ such maps gives $W^{(L)} W^{(L-1)} \cdots W^{(1)} x + \tilde{b}$ , which is a single affine map. Depth without nonlinearity adds no representational power.

Proof Sketch

By induction on $L$ . Base case: a single affine layer is affine. Inductive step: if $f_{L-1}(x) = A_{L-1}x + b_{L-1}$ , then $f_L(x) = W^{(L)}(A_{L-1}x + b_{L-1}) + b^{(L)} = (W^{(L)} A_{L-1})x + (W^{(L)} b_{L-1} + b^{(L)})$ , which is affine.

Why It Matters

This result explains why activation functions exist at all. The universal approximation theorem (Cybenko 1989, Hornik 1991) shows that a single hidden layer with a non-polynomial activation can approximate any continuous function on a compact set. Without nonlinearity, no amount of depth helps.

Failure Mode

The result says nonlinearity is necessary but says nothing about which nonlinearity is best. All non-polynomial activations yield universal approximation in theory. The practical differences are about gradient flow and optimization, not expressiveness.

report a correction →

Gradient Flow Analysis

Sigmoid and Tanh: Vanishing Gradients

For a full treatment of gradient pathologies in deep networks, see gradient flow and vanishing gradients. The core mechanism is the product structure of the chain rule.

During backpropagation through $L$ layers, the gradient is a product of per-layer Jacobians, each containing the activation derivative and the weight matrix:

$\frac{\partial \mathcal{L}}{\partial z^{(1)}} = \frac{\partial \mathcal{L}}{\partial z^{(L)}} \prod_{l=1}^{L-1} \mathrm{diag}\big(\sigma'(z^{(l+1)})\big) \, W^{(l+1)}$

For sigmoid, $\sigma'(x) \leq 1/4$ everywhere, so the activation-derivative contribution shrinks by at most $(1/4)^L$ . The full gradient also depends on $\prod_l \|W^{(l)}\|$ , which can amplify or further attenuate the signal. With $L = 10$ layers and well-conditioned weights, the activation-derivative factor alone shrinks the gradient by at least $10^6$ , dominating any reasonable amplification from the weights. Careful initialization (Xavier/Glorot) targets unit-variance signals but cannot rescue deep sigmoid networks because the $1/4$ cap is multiplicative.

Tanh is better: $\tanh'(0) = 1$ , so near-zero pre-activations pass gradients without attenuation. But for large $|z|$ , $\tanh'(z) \to 0$ and the same saturation problem occurs.

ReLU: Constant Gradients, But Dead Neurons

ReLU has derivative $1$ for positive inputs. No saturation, no vanishing gradient for active neurons. This is why ReLU enabled training of deep networks (Krizhevsky et al. 2012).

The problem: if $z < 0$ , the gradient is exactly $0$ . A neuron whose pre-activation is negative for all training examples receives zero gradient updates and never recovers. This is the dying ReLU problem.

Leaky ReLU: No Dead Neurons

By setting $\sigma'(x) = \alpha > 0$ for $x < 0$ , Leaky ReLU ensures every neuron receives a nonzero gradient. In practice, $\alpha = 0.01$ is standard. Parametric ReLU (PReLU) learns $\alpha$ from data.

Gradient Flow Properties

The practical differences among activation functions come down to three measurable quantities: the magnitude of the derivative, the range of inputs that produce near-zero gradients, and how the gradient distribution evolves across layers during training.

Derivative magnitudes

Activation	$\max \lvert\sigma'(x)\rvert$	Dead zone
Sigmoid	$1/4$ at $x = 0$	$x \ll 0$ or $x \gg 0$
Tanh	$1$ at $x = 0$	$\lvert x\rvert \gg 0$
ReLU	$1$ for $x > 0$	$x < 0$ (permanently)
Leaky ReLU ( $\alpha$ )	$1$ for $x > 0$	none ( $\alpha > 0$ )
GELU	$\approx 1.13$ at $x \approx 0.8$	asymptotically for $x \to -\infty$
SiLU	$\approx 1.10$ at $x \approx 2.4$	asymptotically for $x \to -\infty$

The sigmoid maximum of $1/4$ is the core problem for deep networks. Over $L$ layers, the activation-derivative factor in the gradient is bounded by $(1/4)^L$ , and the full gradient additionally accumulates products of weight-matrix Jacobians. This is the vanishing gradient problem in quantitative form. Tanh has maximum derivative 1, so gradients do not shrink from activation derivatives alone when pre-activations are near zero; the problem only appears for saturated neurons.

ReLU dying neurons: a permanent failure mode

A ReLU neuron "dies" when its pre-activation $z = w^\top x + b$ is negative for every training example. Once the neuron is dead, the gradient through it is zero, so the weights never update, and the neuron stays dead. This can affect a significant fraction of neurons depending on initialization and learning rate.

Batch normalization mitigates dying ReLUs by normalizing pre-activations, keeping most neurons in the active ( $z > 0$ ) region. Weight initialization (He initialization, which scales weights by $\sqrt{2/n_{\text{in}}}$ ) is specifically designed for ReLU to set the pre-activation variance to approximately 1 at initialization, keeping neurons alive at the start of training. These two interventions together largely solve the dying ReLU problem in practice, without requiring a switch to Leaky ReLU.

GELU and SiLU: smooth approximations that exceed 1

Both GELU and SiLU have peak derivatives slightly above 1 (around $x \approx 0.8$ and $x \approx 2.4$ respectively). This is a feature, not a bug: the small gradient amplification helps offset attenuation from earlier layers. Both functions are smooth, which makes the gradient landscape more regular and better conditioned for second-order methods.

The GELU derivative is $\Phi(x) + x \phi(x)$ where $\phi$ is the standard normal density. The SiLU derivative is $\sigma(x)(1 + x(1 - \sigma(x)))$ . Neither has a closed-form saturation threshold; both decay gradually toward zero as $x \to -\infty$ , so dying neurons are possible but far less abrupt than with ReLU.

Practical choice

For convolutional networks and residual architectures: ReLU remains a solid default, with Leaky ReLU or ELU when dead neurons are a concern. For transformers and large language models: GELU and SiLU are standard, with SwiGLU dominating recent open-weight models. The performance differences among GELU, SiLU, and SwiGLU are small; the choice is mostly convention within an architecture family.

Canonical Examples

Example

Sigmoid saturation in a 5-layer network

Consider a network with sigmoid activations and all pre-activations initialized at $z = 3$ . Then $\sigma'(3) = \sigma(3)(1 - \sigma(3)) \approx 0.95 \times 0.05 = 0.045$ . The gradient through 5 layers is attenuated by a factor of $(0.045)^5 \approx 1.8 \times 10^{-7}$ . Training will stall in the early layers.

With ReLU and positive pre-activations, the gradient factor is $1^5 = 1$ . No attenuation.

Common Confusions

Watch Out

ReLU is not differentiable at zero, but this does not matter in practice

ReLU has a kink at $x = 0$ where the derivative is undefined. In practice, the probability that any pre-activation is exactly zero is zero (under continuous distributions). Implementations set $\sigma'(0) = 0$ by convention. This has no measurable effect on training.

Watch Out

All common activations yield universal approximation

The universal approximation theorem holds for any non-polynomial continuous activation. ReLU, sigmoid, tanh, GELU, and SiLU all have this property. The choice of activation does not affect what functions a network can represent in the infinite-width limit. It affects what functions gradient-based optimization will find in finite time with finite width.

Watch Out

GELU is not just smooth ReLU

GELU is sometimes described as a smooth approximation to ReLU. This is misleading. GELU weights each input by its probability under a standard Gaussian: $\text{GELU}(x) = x \cdot P(Z \leq x)$ where $Z \sim \mathcal{N}(0,1)$ . For large positive $x$ , $\text{GELU}(x) \approx x$ . For large negative $x$ , $\text{GELU}(x) \approx 0$ . The transition is smooth and data-dependent, unlike the hard threshold of ReLU.

Summary

Without nonlinear activations, depth is useless: the network computes an affine map
Sigmoid and tanh saturate for large $|x|$ , causing vanishing gradients in deep networks
ReLU solved vanishing gradients for positive inputs but introduced dead neurons
Leaky ReLU, GELU, and SiLU avoid dead neurons while maintaining good gradient flow
Activation choice affects optimization dynamics, not theoretical expressiveness
Modern transformers use GELU, SiLU, or SwiGLU; SwiGLU is the FFN nonlinearity in LLaMA, Mistral, Gemma, and PaLM

Exercises

ExerciseCore

Problem

Compute $\sigma'(x)$ for the sigmoid function $\sigma(x) = 1/(1 + e^{-x})$ and show that $\max_x \sigma'(x) = 1/4$ .

ExerciseCore

Problem

A 10-layer network uses sigmoid activations. Assume all pre-activations are at $x = 0$ (best case for gradient flow). What is the maximum gradient attenuation factor through all 10 layers from activation derivatives alone?

ExerciseAdvanced

Problem

Show that for any $\alpha > 0$ , Leaky ReLU with slope $\alpha$ for negative inputs has no dead neurons. Precisely: show that for any weight configuration, $\partial \text{LeakyReLU}(z) / \partial z \neq 0$ for all $z \neq 0$ .

Related Comparisons

SwiGLU vs. GELU vs. ReLU

References

Canonical:

Goodfellow, Bengio, Courville, Deep Learning (2016), Chapter 6.3 (activation functions and gradient flow)
Cybenko, "Approximation by Superpositions of a Sigmoidal Function" (1989)
Hornik, "Approximation Capabilities of Multilayer Feedforward Networks" (1991)
Krizhevsky, Sutskever, Hinton, "ImageNet Classification with Deep CNNs" (NeurIPS 2012) — large-scale ReLU validation

Current:

Hendrycks & Gimpel, "Gaussian Error Linear Units (GELUs)" (2016), arXiv:1606.08415
Ramachandran, Zoph, Le, "Searching for Activation Functions" (2017) (Swish/SiLU), arXiv:1710.05941
Shazeer, "GLU Variants Improve Transformer" (2020), arXiv:2002.05202 (SwiGLU)
Nair & Hinton, "Rectified Linear Units Improve Restricted Boltzmann Machines" (2010)
He et al., "Delving Deep into Rectifiers" (ICCV 2015) — PReLU and He initialization for ReLU networks

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Differentiation in Rⁿlayer 0A · tier 1
Convex Optimization Basicslayer 1 · tier 1

Derived topics

4

Batch Normalizationlayer 2 · tier 1
Feedforward Networks and Backpropagationlayer 2 · tier 1
Weight Initializationlayer 2 · tier 1
Kolmogorov-Arnold Networks (KANs)layer 4 · tier 2

Graph-backed continuations

Batch Normalization Weight Initialization Feedforward Networks and Backpropagation Kolmogorov-Arnold Networks (KANs)