Continuity in Rⁿ

Sneiderman, Robby

Foundations

Continuity in Rⁿ

Epsilon-delta continuity, uniform continuity, and Lipschitz continuity in Euclidean space. Lipschitz constants control how fast function values change and appear throughout optimization and generalization theory.

CoreTier 1StableSupporting~35 min

Prerequisites

Metric Spaces Convergence Completeness

Quiz (18)Pulse Check Prereq Map

Why This Matters

Continuity is the minimal regularity condition you usually want on a loss function: without it, small changes in parameters need not produce small changes in predictions, and the extreme value theorem — which guarantees a minimizer on a nonempty compact set — does not apply. Continuity does not, by itself, make gradient methods well-behaved: gradient descent additionally needs differentiability or subgradient structure, smoothness or step-size control, some convexity / stationarity assumption, lower boundedness, and frequently coercivity or compactness of sublevel sets. So treat continuity as the entry-level regularity that makes existence-of-minimizers arguments possible, not as a sufficient condition for any particular optimizer to converge.

Five-panel infographic on continuity in R^n: epsilon-delta and sequential definitions, equivalence on metric spaces, what continuity preserves (compactness, connectedness), uniform continuity vs pointwise continuity (with a heat-equation example), and applications in optimization (continuous loss => well-defined argmin on compact sets) and analysis (Weierstrass extreme value theorem). — Continuity in R^n preserves compactness and connectedness. Uniform continuity is the version that survives passing to limits and underwrites every approximation argument in analysis.

is the variant that appears most in ML theory. Generalization bounds for neural networks often depend on the Lipschitz constant of the network. Wasserstein GANs enforce a Lipschitz constraint explicitly.

The greek letters and are the canonical pair in every continuity definition below.

Core Definitions

Definition

Continuity at a Point $f continuous at a$

A function $f: \mathbb{R}^n \to \mathbb{R}^m$ (where $\mathbb{R}^n$ carries the standard metric) is continuous at $a \in \mathbb{R}^n$ if and only if for every $\epsilon > 0$ , there exists $\delta > 0$ such that:

$\|x - a\| < \delta \implies \|f(x) - f(a)\| < \epsilon$

The choice of $\delta$ can depend on both $\epsilon$ and the point $a$ .

Definition

Uniform Continuity

$f: \mathbb{R}^n \to \mathbb{R}^m$ is uniformly continuous on a set $S$ if and only if for every $\epsilon > 0$ , there exists $\delta > 0$ such that for all $x, y \in S$ :

$\|x - y\| < \delta \implies \|f(x) - f(y)\| < \epsilon$

The $\delta$ depends only on $\epsilon$ , not on the specific points $x, y$ .

Definition

Lipschitz Continuity $L -Lipschitz$

$f: \mathbb{R}^n \to \mathbb{R}^m$ is $L$ -Lipschitz on a set $S$ if and only if there exists $L \geq 0$ such that for all $x, y \in S$ :

$\|f(x) - f(y)\| \leq L \|x - y\|$

The smallest such $L$ is the Lipschitz constant of $f$ on $S$ .

The hierarchy is strict: Lipschitz implies uniformly continuous implies continuous. The converses fail. The function $f(x) = \sqrt{x}$ on $[0,1]$ is uniformly continuous but not Lipschitz (its derivative blows up at 0). The function $f(x) = x^2$ on $\mathbb{R}$ is continuous but not uniformly continuous.

Composition and Algebraic Properties

Continuous functions compose: if $f$ is continuous at $a$ and $g$ is continuous at $f(a)$ , then $g \circ f$ is continuous at $a$ . Sums, products, and quotients (where the denominator is nonzero) of continuous functions are continuous.

For Lipschitz functions, the composition rule is quantitative. If $f$ is $L_f$ -Lipschitz and $g$ is $L_g$ -Lipschitz, then $g \circ f$ is $L_f L_g$ -Lipschitz. This multiplicative blowup is why deep networks can have large Lipschitz constants: each layer multiplies.

Main Theorems

Theorem

Extreme Value Theorem

Statement

If $f: S \to \mathbb{R}$ is continuous and $S \subseteq \mathbb{R}^n$ is compact and nonempty, then $f$ attains its maximum and minimum on $S$ . That is, there exist $x_{\min}, x_{\max} \in S$ such that:

$f(x_{\min}) \leq f(x) \leq f(x_{\max}) \quad \text{for all } x \in S$

Intuition

A continuous function on a closed and bounded set cannot "escape to infinity" or "approach but never reach" its supremum. Compactness traps sequences and continuity preserves limits.

Proof Sketch

Since $f$ is continuous and $S$ is compact, $f(S)$ is compact in $\mathbb{R}$ (continuous image of a compact set is compact). A compact subset of $\mathbb{R}$ is closed and bounded, so it contains its supremum and infimum.

Why It Matters

This theorem guarantees that optimization problems over compact sets have solutions. When you minimize a continuous loss over a bounded parameter space, a minimizer exists. Without compactness, minimizers may not exist: $\inf_{x > 0} 1/x = 0$ but no $x > 0$ achieves it.

Failure Mode

Fails without compactness. On the open interval $(0,1)$ , the function $f(x) = 1/x$ is continuous but has no maximum. Fails without continuity: the indicator function $\mathbf{1}_{\{0\}}$ on $[-1,1]$ achieves its max but a discontinuous function in general need not.

report a correction →

Theorem

Heine-Cantor Theorem

Statement

If $f: S \to \mathbb{R}^m$ is continuous and $S \subseteq \mathbb{R}^n$ is compact, then $f$ is uniformly continuous on $S$ .

Intuition

On a compact set, continuity cannot degrade from point to point. The worst-case $\delta$ over all points in $S$ is still positive because $S$ has no "escape to infinity" where the modulus of continuity might shrink to zero.

Proof Sketch

Suppose $f$ is not uniformly continuous. Then there exists $\epsilon > 0$ and sequences $x_k, y_k$ with $\|x_k - y_k\| \to 0$ but $\|f(x_k) - f(y_k)\| \geq \epsilon$ . By compactness, extract a convergent subsequence $x_{k_j} \to a$ . Then $y_{k_j} \to a$ as well. By continuity at $a$ , $\|f(x_{k_j}) - f(y_{k_j})\| \to 0$ , contradicting $\geq \epsilon$ .

Why It Matters

This is why bounded parameter spaces simplify analysis. A continuous loss function on a compact parameter set is automatically uniformly continuous, which makes approximation arguments (like discretizing the parameter space) valid.

Failure Mode

Fails on non-compact domains. $f(x) = x^2$ is continuous on $\mathbb{R}$ but not uniformly continuous: for large $x$ , a small change in $x$ produces a large change in $x^2$ .

report a correction →

Examples

Example

Why $\sqrt{x}$ is uniformly continuous on $[0,1]$ but not Lipschitz

On $[0,1]$ , $f(x) = \sqrt{x}$ is continuous and the domain is compact, so by Heine-Cantor it is uniformly continuous. To see it is not Lipschitz, note $f'(x) = 1/(2\sqrt{x})$ blows up as $x \to 0^+$ . Take $x = 1/n^2$ and $y = 0$ : $|f(x) - f(y)| = 1/n$ but $|x - y| = 1/n^2$ , so the ratio $|f(x) - f(y)|/|x - y| = n$ grows without bound. No single $L$ works.

The modulus of continuity is $\omega_f(\delta) = \sqrt{\delta}$ . This is a Hölder continuity of exponent $1/2$ , a strictly weaker condition than Lipschitz (Hölder with exponent 1).

Example

Lipschitz constant of a linear layer

Let $f(x) = Wx + b$ with $W \in \mathbb{R}^{m \times n}$ . For any $x, y$ , $\|f(x) - f(y)\| = \|W(x-y)\| \leq \|W\|_{\mathrm{op}} \|x - y\|$ , where $\|W\|_{\mathrm{op}}$ is the operator norm (largest singular value). The Lipschitz constant of an affine map is exactly its operator norm; the constant $b$ contributes nothing because it cancels in the difference.

For a deep network $f = f_L \circ \cdots \circ f_1$ with each $f_\ell$ a linear layer followed by a 1-Lipschitz activation (ReLU, tanh, sigmoid all have Lipschitz constant 1), the composition rule gives $\mathrm{Lip}(f) \leq \prod_{\ell} \|W_\ell\|_{\mathrm{op}}$ . Spectral normalization enforces $\|W_\ell\|_{\mathrm{op}} \leq 1$ to keep the end-to-end Lipschitz constant bounded; this is the mechanism behind Wasserstein GAN training.

Why Lipschitz Constants Matter in ML

Three places where the Lipschitz constant directly controls a quantity of interest:

Gradient descent step size. For an $L$ -smooth function (gradient is $L$ -Lipschitz), gradient descent with step size $\eta \leq 1/L$ is monotone non-increasing in the loss. Step sizes above $2/L$ can diverge even on convex problems. See SGD convergence.
Generalization bounds. Many uniform-convergence bounds for neural networks scale with the product of layer-wise Lipschitz constants (operator norms of weight matrices). This connects directly to the Rademacher complexity of the function class: smaller Lipschitz constants give tighter generalization bounds.
Robustness. A network with Lipschitz constant $L$ has the property that an input perturbation of size $\epsilon$ changes the output by at most $L\epsilon$ . Adversarial examples exploit large local Lipschitz constants; certified robustness methods enforce small ones explicitly.

The price of small Lipschitz constants is expressiveness. Universal approximation requires unbounded Lipschitz constants in general; trading off the Lipschitz constant against approximation error is one of the central design tensions in modern deep learning theory.

Common Confusions

Watch Out

Lipschitz constant depends on the norm

The Lipschitz constant of a function depends on which norm you use. A function that is 1-Lipschitz in the $\ell_2$ norm may have a different Lipschitz constant in the $\ell_\infty$ norm. In ML, the $\ell_2$ norm is the default unless stated otherwise.

Watch Out

Differentiable does not imply Lipschitz

A function can be differentiable everywhere without being Lipschitz. $f(x) = x^2$ on $\mathbb{R}$ is smooth but not Lipschitz because its derivative is unbounded.

The right "bounded derivative gives Lipschitz" statement also depends on the codomain and the domain geometry:

Scalar-valued $f: \mathbb{R}^n \to \mathbb{R}$ on a convex domain. If $\|\nabla f(x)\| \leq L$ for all $x$ , then $f$ is $L$ -Lipschitz (mean value theorem along line segments).
Vector-valued $f: \mathbb{R}^n \to \mathbb{R}^m$ on a convex domain. The relevant object is the operator norm of the Jacobian: if $\|J_f(x)\|_{\mathrm{op}} \leq L$ for all $x$ , then $f$ is $L$ -Lipschitz. Bounding individual partials is not enough; you need the operator-norm bound.
Non-convex domain. Both statements can fail: even with a bounded gradient, a "U"-shaped domain can force a long path between two nearby points and make the global Lipschitz constant much larger than the local gradient bound.

Exercises

ExerciseCore

Problem

Prove that $f(x) = \|x\|_2$ is 1-Lipschitz on $\mathbb{R}^n$ .

ExerciseAdvanced

Problem

Let $f: \mathbb{R}^n \to \mathbb{R}$ be differentiable with $\|\nabla f(x)\| \leq L$ for all $x$ . Prove that $f$ is $L$ -Lipschitz.

References

Canonical:

Rudin, Principles of Mathematical Analysis (1976), Chapters 4 and 7
Apostol, Mathematical Analysis (1974), Chapter 4
Folland, Real Analysis (1999), Chapter 4 (continuity and topology)

Current:

Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Section 26.1 (Lipschitz conditions in generalization)
Vershynin, High-Dimensional Probability (2018), Section 5.2.2 (Lipschitz functions and concentration)
Deisenroth, Faisal, Ong, Mathematics for Machine Learning (2020), Section 5.1 (continuity in the context of differentiation)
Arjovsky, Chintala, Bottou, "Wasserstein GAN" (2017), arXiv:1701.07875 (Lipschitz constraint enforced on the critic)
Miyato et al., "Spectral Normalization for Generative Adversarial Networks" (2018), arXiv:1802.05957 (constraining per-layer Lipschitz constants via spectral norm)
Virmaux & Scaman, "Lipschitz Regularity of Deep Neural Networks" (2018), NeurIPS (bounding end-to-end Lipschitz constants of deep networks)

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Metric Spaces, Convergence, and Completenesslayer 0A · tier 1

Derived topics

4

Differentiation in Rⁿlayer 0A · tier 1
Taylor Expansionlayer 0A · tier 1
Classical ODEs: Existence, Stability, and Numerical Methodslayer 1 · tier 1
Convex Optimization Basicslayer 1 · tier 1

Graph-backed continuations

Convex Optimization Basics Taylor Expansion Classical ODEs: Existence, Stability, and Numerical Methods Differentiation in Rⁿ