Skip to main content

ML Methods

Loss Functions Catalog

A systematic catalog of loss functions for classification, regression, and representation learning: cross-entropy, MSE, hinge, Huber, focal, KL divergence, and contrastive loss.

CoreTier 1StableSupporting~50 min

Prerequisites

Why This Matters

The loss function defines what "good" means for your model. Two models with identical architectures trained with different loss functions will learn different things. In many practical settings, switching the loss function improves performance more than switching the architecture. The choice of loss encodes your assumptions about the problem: noise distribution, outlier sensitivity, class balance, and what errors cost. See cross-entropy loss deep dive for a detailed treatment of the most common classification loss.

From the convex optimization perspective, convexity of the loss in the parameters matters: it determines whether local minima are global, and whether subgradient methods converge. Most standard loss functions (cross-entropy, MSE, Huber, hinge) are convex in the predictions, but the composed loss through a neural network is non-convex in the weights.

Mental Model

A loss function (y^,y)\ell(\hat{y}, y) measures the cost of predicting y^\hat{y} when the truth is yy. Different losses penalize different types of errors. MSE penalizes large errors quadratically, making it the natural choice for linear regression under Gaussian noise. MAE penalizes all errors linearly (robust to outliers). Cross-entropy penalizes confident wrong classification predictions severely. The right loss depends on what errors matter in your application.

Classification Losses

Definition

Cross-Entropy Loss

For a classification problem with KK classes, the cross-entropy loss for a single example with true label yy (one-hot encoded) and predicted probabilities pp is:

LCE=k=1KyklogpkL_{\text{CE}} = -\sum_{k=1}^{K} y_k \log p_k

For binary classification with y{0,1}y \in \{0, 1\} and predicted probability pp:

LBCE=ylogp(1y)log(1p)L_{\text{BCE}} = -y \log p - (1 - y) \log(1 - p)

Cross-entropy has a critical property: as pk0p_k \to 0 for the true class, the loss goes to infinity. This severe penalty for confident wrong predictions drives the model to assign high probability to the correct class.

Definition

Hinge Loss

For binary classification with y{1,+1}y \in \{-1, +1\} and raw prediction f(x)Rf(x) \in \mathbb{R}:

Lhinge=max(0,1yf(x))L_{\text{hinge}} = \max(0, 1 - y \cdot f(x))

The loss is zero when yf(x)1y \cdot f(x) \geq 1 (correct prediction with margin at least 1). This is the loss used by support vector machines.

Hinge loss does not require probability outputs and is not differentiable at yf(x)=1y \cdot f(x) = 1. In practice, subgradient methods handle the non-differentiability.

Definition

Focal Loss

For binary classification with true class probability ptp_t (the model's predicted probability for the true class):

Lfocal=(1pt)γlogptL_{\text{focal}} = -(1 - p_t)^\gamma \log p_t

where γ0\gamma \geq 0 is a focusing parameter. When γ=0\gamma = 0, this reduces to cross-entropy.

Focal loss down-weights easy examples (where ptp_t is high). For γ=2\gamma = 2, an example with pt=0.9p_t = 0.9 gets weight (0.1)2=0.01(0.1)^2 = 0.01, while an example with pt=0.1p_t = 0.1 gets weight (0.9)2=0.81(0.9)^2 = 0.81. This concentrates learning on hard examples, which is critical for class-imbalanced problems like object detection where 99%+ of candidates are background.

Regression Losses

Definition

Mean Squared Error

For a regression problem with prediction y^\hat{y} and target yy:

LMSE=1ni=1n(yiy^i)2L_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

MSE is the maximum likelihood estimator under a Gaussian noise model: y=f(x)+ϵy = f(x) + \epsilon where ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2).

Definition

Huber Loss

For a threshold δ>0\delta > 0:

Lδ(r)={12r2if rδδr12δ2if r>δL_\delta(r) = \begin{cases} \frac{1}{2}r^2 & \text{if } |r| \leq \delta \\ \delta|r| - \frac{1}{2}\delta^2 & \text{if } |r| > \delta \end{cases}

where r=yy^r = y - \hat{y} is the residual. Huber loss is quadratic for small errors and linear for large errors.

Huber loss combines the benefits of MSE (smooth, efficient for Gaussian errors) and MAE (robust to outliers). The parameter δ\delta controls the transition. When δ\delta is large, Huber approaches MSE. When δ\delta is small, it approaches MAE.

Divergence-Based Losses

Definition

KL Divergence Loss

The Kullback-Leibler divergence from distribution qq to distribution pp is:

DKL(pq)=kpklogpkqkD_{\text{KL}}(p \| q) = \sum_{k} p_k \log \frac{p_k}{q_k}

KL divergence is non-negative (DKL0D_{\text{KL}} \geq 0 by Gibbs' inequality) and equals zero if and only if p=qp = q. It is not symmetric: DKL(pq)DKL(qp)D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p) in general.

KL divergence is used in knowledge distillation (matching a student's output distribution to a teacher's), variational autoencoders (regularizing the latent distribution toward a prior), and reinforcement learning from human feedback (penalizing deviation from a reference policy).

Definition

Contrastive Loss

For a pair of examples (xi,xj)(x_i, x_j) with label yij{0,1}y_{ij} \in \{0, 1\} indicating whether they are similar:

Lcontrastive=yijd(xi,xj)2+(1yij)max(0,md(xi,xj))2L_{\text{contrastive}} = y_{ij} \cdot d(x_i, x_j)^2 + (1 - y_{ij}) \cdot \max(0, m - d(x_i, x_j))^2

where dd is a distance function and mm is a margin. Similar pairs are pulled together; dissimilar pairs are pushed apart until they are at least distance mm apart.

Main Theorems

Theorem

Cross-Entropy Minimization Equals Maximum Likelihood

Statement

For a model parameterized by θ\theta that outputs class probabilities pθ(yx)p_\theta(y|x), minimizing the average cross-entropy loss on a dataset {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n is equivalent to maximizing the log-likelihood:

argminθ1ni=1nLCE(pθ(xi),yi)=argmaxθ1ni=1nlogpθ(yixi)\arg\min_\theta \frac{1}{n}\sum_{i=1}^n L_{\text{CE}}(p_\theta(x_i), y_i) = \arg\max_\theta \frac{1}{n}\sum_{i=1}^n \log p_\theta(y_i | x_i)

Intuition

Cross-entropy loss for a one-hot target is just logpθ(yixi)-\log p_\theta(y_i | x_i): the negative log-probability of the true class. Summing over examples gives the negative log-likelihood. Minimizing the negative is maximizing the positive.

Proof Sketch

Expand the cross-entropy: LCE=kyklogpkL_{\text{CE}} = -\sum_k y_k \log p_k. For a one-hot label where yc=1y_c = 1 and yk=0y_k = 0 for kck \neq c, this simplifies to logpc=logpθ(yixi)-\log p_c = -\log p_\theta(y_i | x_i). Sum over the dataset and negate.

Why It Matters

This equivalence connects two perspectives: the information-theoretic view (cross-entropy measures how many extra bits your model needs) and the statistical view (maximum likelihood is the optimal estimator under regularity conditions). It justifies using cross-entropy as the default classification loss.

Failure Mode

The equivalence holds only when the model outputs valid probability distributions (non-negative, sum to 1). If the model is miscalibrated (probabilities do not reflect true frequencies), cross-entropy still works for discrimination but the probabilistic interpretation breaks down.

Proposition

Huber Loss Bounded Influence Function

Statement

The influence function of the Huber loss estimator is bounded: for any observation yy,

ψδ(y)δ|\psi_\delta(y)| \leq \delta

where ψδ(y)=Lδ/y^\psi_\delta(y) = \partial L_\delta / \partial \hat{y}. In contrast, the influence function of MSE is unbounded: ψMSE(y)=yy^|\psi_{\text{MSE}}(y)| = |y - \hat{y}|, which grows without limit.

Intuition

MSE's gradient is proportional to the residual, so a single outlier with residual 1000 exerts 1000x more influence than a typical point. Huber's gradient is capped at δ\delta, so no single point can dominate the gradient regardless of how far it is from the prediction.

Proof Sketch

The gradient of Huber loss is ψδ(r)=r\psi_\delta(r) = r for rδ|r| \leq \delta and ψδ(r)=δsign(r)\psi_\delta(r) = \delta \cdot \text{sign}(r) for r>δ|r| > \delta. The maximum absolute value is δ\delta, achieved for all rδ|r| \geq \delta.

Why It Matters

In real datasets, outliers are common (mislabeled examples, sensor errors, data entry mistakes). Huber loss provides a principled way to limit their influence without requiring explicit outlier removal. The parameter δ\delta controls the tradeoff: smaller δ\delta means more robustness but less statistical efficiency under Gaussian noise.

Failure Mode

Huber loss is robust to outliers in the target yy, not in the input xx. A leverage point (outlier in input space) can still distort the fit. For robustness to both, you need methods from robust regression (e.g., M-estimators with bounded leverage).

Gradient Expressions

For gradient descent and backpropagation, what matters is the gradient of the loss with respect to the prediction y^\hat{y} (or raw logit zz before the activation).

Cross-entropy (binary, sigmoid output): Let zz be the pre-sigmoid logit and p=σ(z)p = \sigma(z). Then:

LBCEz=py\frac{\partial L_{\text{BCE}}}{\partial z} = p - y

The gradient is simply the prediction error. This clean form is why cross-entropy pairs naturally with sigmoid: the sigmoid's derivative cancels against the log's derivative.

MSE:

LMSEy^=2n(y^y)\frac{\partial L_{\text{MSE}}}{\partial \hat{y}} = \frac{2}{n}(\hat{y} - y)

Gradient is proportional to the residual; large residuals dominate.

Huber loss:

Lδy^={y^yif y^yδδsign(y^y)if y^y>δ\frac{\partial L_\delta}{\partial \hat{y}} = \begin{cases} \hat{y} - y & \text{if } |\hat{y} - y| \leq \delta \\ \delta \cdot \text{sign}(\hat{y} - y) & \text{if } |\hat{y} - y| > \delta \end{cases}

Gradient is capped at ±δ\pm\delta, so outliers exert bounded influence on the update.

MAE:

LMAEy^=sign(y^y)\frac{\partial L_{\text{MAE}}}{\partial \hat{y}} = \text{sign}(\hat{y} - y)

Gradient is constant magnitude regardless of residual size. This makes MAE robust but can cause slow convergence near the optimum (no gradient decay).

Hinge loss:

Lhingef(x)={yif yf(x)<10if yf(x)1\frac{\partial L_{\text{hinge}}}{\partial f(x)} = \begin{cases} -y & \text{if } y \cdot f(x) < 1 \\ 0 & \text{if } y \cdot f(x) \geq 1 \end{cases}

Zero gradient when the margin condition is satisfied; the loss only responds to margin violations.

KL divergence (as qq is being optimized to match pp):

DKL(pq)qk=pkqk\frac{\partial D_{\text{KL}}(p \| q)}{\partial q_k} = -\frac{p_k}{q_k}

This is why minimizing KL divergence is mode-seeking: the gradient blows up when qk0q_k \to 0 for a region where pk>0p_k > 0, forcing qq to cover all modes of pp.

Interactive: loss and gradient side by side

Every row of the table below collapses two shapes into one name. The loss curve says how much a residual or margin costs; the gradient curve says what the optimizer will do about it. Toggle overlays to compare, and slide δ\delta to watch Huber interpolate between MSE (quadratic near zero) and MAE (linear in the tails).

Loss Function Comparison Table

LossConvex?Smooth?Robust to outliers?ProbabilisticTypical use
Cross-entropyYesYesNoYesClassification
Binary cross-entropyYesYesNoYesBinary classification
MSEYesYesNoYes (Gaussian)Regression, linear models
MAEYesNo (at 0)YesYes (Laplace)Regression, robust fitting
HuberYesYesYes (bounded)ApproximateRegression with outliers
HingeYesNo (at 1)PartiallyNoSVM, max-margin classifiers
FocalNo (in ptp_t)YesPartiallyYesImbalanced classification
KL divergenceYes (in qq)YesNoYesDistillation, VAEs, RLHF
ContrastiveNoPartiallyPartiallyNoMetric learning

Notes on the table:

  • "Smooth" means differentiable everywhere. MAE and hinge have kink points requiring subgradients.
  • "Robust to outliers" means the influence function is bounded. MSE and cross-entropy have unbounded influence.
  • Focal loss is convex in the prediction logit zz for some γ\gamma values but not all; treat as non-convex in general.
  • "Probabilistic" means the loss corresponds to a negative log-likelihood for some distribution.

When-to-Use Decision Guide

Classification:

  • Start with cross-entropy. It is the MLE loss for the categorical distribution and pairs cleanly with softmax.
  • Switch to focal loss when positive examples are fewer than 1% of the data (object detection, rare event detection).
  • Use hinge loss only when you specifically want maximum-margin geometry (SVM-based models or structured prediction).

Regression:

  • Start with MSE when you expect Gaussian noise and have no clear outliers.
  • Switch to Huber when residuals have heavy tails or when mislabeled examples are present. Tune δ\delta to the scale of typical residuals.
  • Use MAE when the median (not mean) is the target quantity, or when the noise is Laplace-distributed.

Distribution matching / generative modeling:

  • Use KL divergence for knowledge distillation (forward KL: mean-seeking, covers all modes of the teacher) or for RLHF penalty terms.
  • Use reverse KL in variational inference (reverse KL: mode-seeking, latches onto one mode of the posterior).

Metric / representation learning:

  • Use contrastive loss or triplet loss for embedding-space problems where class boundaries are not pre-defined.
  • InfoNCE (used in CLIP, SimCLR) is a contrastive loss that has been shown to maximize a lower bound on mutual information.

Why Loss Choice Matters More Than Architecture

For a fixed architecture, the loss function determines what the model optimizes. Concrete examples:

  • Object detection with cross-entropy treats all misclassifications equally. With focal loss, the model focuses on hard negatives and achieves significantly higher mAP.
  • Regression with MSE on heavy-tailed data produces estimates pulled toward outliers. Switching to Huber or MAE can reduce test error by 20%+ without changing the model.
  • Knowledge distillation with hard labels (cross-entropy on argmax) loses information. Soft labels with KL divergence preserve the teacher's inter-class relationships.

Common Confusions

Watch Out

Cross-entropy and log loss are the same thing

In the binary case, cross-entropy loss and log loss (logistic loss) are identical: ylogp(1y)log(1p)-y\log p - (1-y)\log(1-p). In the multi-class case, "log loss" typically refers to the same formula as multi-class cross-entropy. The terms are interchangeable.

Watch Out

KL divergence is not a distance

DKL(pq)DKL(qp)D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p) and KL divergence does not satisfy the triangle inequality. It is a divergence, not a metric. The direction matters: DKL(pq)D_{\text{KL}}(p \| q) penalizes places where p>0p > 0 but q0q \approx 0 (mode-seeking when optimizing qq), while DKL(qp)D_{\text{KL}}(q \| p) does the reverse (mean-seeking).

Watch Out

Hinge loss does not produce probability estimates

Unlike cross-entropy, hinge loss does not require or produce probability outputs. An SVM's raw output f(x)f(x) is a signed distance from the decision boundary, not a probability. To get probabilities from an SVM, you need Platt scaling as a post-processing step.

Summary

  • Cross-entropy = negative log-likelihood for classification; the default choice
  • MSE assumes Gaussian noise; use Huber or MAE when outliers are present
  • Focal loss addresses class imbalance by down-weighting easy examples
  • Hinge loss creates maximum-margin classifiers (SVMs)
  • KL divergence measures distributional mismatch; critical for distillation and VAEs
  • Contrastive loss learns representations by comparing pairs
  • The choice of loss encodes assumptions about noise, class balance, and error costs

Exercises

ExerciseCore

Problem

Compute the cross-entropy loss for a 3-class problem where the true label is class 2 (zero-indexed) and the model predicts p=[0.1,0.2,0.7]p = [0.1, 0.2, 0.7].

ExerciseCore

Problem

For Huber loss with δ=1\delta = 1, compute the loss for residuals r=0.5r = 0.5, r=1r = 1, and r=10r = 10. Compare with MSE for the same residuals.

ExerciseAdvanced

Problem

Show that focal loss with γ=0\gamma = 0 reduces to cross-entropy, and explain why increasing γ\gamma concentrates the loss on hard examples. Compute the ratio of focal loss at pt=0.1p_t = 0.1 to focal loss at pt=0.9p_t = 0.9 for γ=0\gamma = 0 and γ=2\gamma = 2.

References

Canonical:

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3: cross-entropy, logistic regression, softmax; Chapter 7.1: SVM and hinge loss derivation
  • Huber, "Robust Estimation of a Location Parameter" (1964), Annals of Mathematical Statistics: original Huber loss paper and the bounded-influence philosophy
  • Hastie, Tibshirani & Friedman, Elements of Statistical Learning (2009), Chapter 10: loss functions for boosting; Chapter 11: MSE, MAE, and their robustness properties

Current:

  • Lin et al., "Focal Loss for Dense Object Detection" (2017), ICCV. arXiv:1708.02002: introduces focal loss and the class-imbalance argument
  • Khosla et al., "Supervised Contrastive Learning" (2020), NeurIPS. arXiv:2004.11362: extends contrastive loss to supervised setting
  • Hadsell, Chopra & LeCun, "Dimensionality Reduction by Learning an Invariant Mapping" (2006), CVPR: original contrastive loss formulation
  • Kullback & Leibler, "On Information and Sufficiency" (1951): original KL divergence paper and its connection to maximum likelihood

Last reviewed: April 14, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1

Derived topics

2