Multi-Class and Multi-Label Classification

Sneiderman, Robby

ML Methods

Multi-Class and Multi-Label Classification

Multi-class (exactly one label, softmax + cross-entropy) vs multi-label (multiple labels, sigmoid + binary cross-entropy). One-vs-rest, one-vs-one, hierarchical classification, and evaluation metrics.

CoreTier 2StableSupporting~40 min

Prerequisites

Logistic Regression

Quiz (6)Pulse Check Prereq Map

Why This Matters

Binary classification is the special case. Real problems often involve multiple classes (image recognition with 1000 categories) or multiple labels per example (a movie that is both "comedy" and "romance"). The choice of output activation, loss function, and evaluation metric must match the problem structure. Using softmax for a multi-label problem is a common and costly mistake.

Setup Checklist

Problem shape	Output layer	Training loss	Inference rule	Typical metrics
Exactly one label from $K$ classes	softmax	categorical cross-entropy	argmax	accuracy, macro/micro F1, top- $k$ accuracy
Zero, one, or many labels from $K$ labels	independent sigmoid per label	binary cross-entropy summed over labels	threshold or ranking rule per label	macro/micro F1, exact match, Hamming loss, average precision

Multi-Class Classification

Definition

Multi-Class Classification

Each input $x$ has exactly one correct label $y \in \{1, \ldots, K\}$ . The model outputs a probability distribution over $K$ classes: $p_k(x) = P(Y = k \mid X = x)$ with $\sum_{k=1}^K p_k(x) = 1$ .

Softmax

The softmax function converts a vector of logits $z \in \mathbb{R}^K$ into a valid probability distribution:

$p_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}$

Properties: all outputs are positive, they sum to 1, and increasing one logit decreases all other probabilities. This mutual exclusivity is correct for multi-class but wrong for multi-label.

Loss: Categorical Cross-Entropy

$\ell = -\log p_c$

where $c$ is the true class. This is the negative log-likelihood under a categorical distribution.

Proposition

Softmax Cross-Entropy Gradient

Statement

The gradient of the categorical cross-entropy loss with respect to the logits $z_k$ has a remarkably simple form:

$\frac{\partial \ell}{\partial z_k} = p_k - \mathbb{1}[k = c]$

where $p_k = \text{softmax}(z)_k$ and $c$ is the true class index.

Intuition

The gradient is the difference between the predicted probability and the target (which is 1 for the true class, 0 for others). If the model assigns probability 0.8 to the correct class, the gradient for that logit is $0.8 - 1 = -0.2$ , pushing it up. All other logits have gradients equal to their predicted probabilities (positive, pushing them down). This is the same "prediction minus target" form as in linear and logistic regression.

Proof Sketch

For $k = c$ : $\partial \ell / \partial z_c = -\partial \log p_c / \partial z_c = -(1 - p_c) = p_c - 1$ . For $k \neq c$ : $\partial \ell / \partial z_k = -\partial \log p_c / \partial z_k = p_k$ . Both cases are captured by $p_k - \mathbb{1}[k = c]$ .

Why It Matters

This clean gradient is why softmax + cross-entropy is the universal choice for multi-class classification. The gradient never saturates (unlike MSE with sigmoid) and is trivially cheap to compute.

Failure Mode

When classes are hierarchical (e.g., "golden retriever" is a "dog" is an "animal"), flat softmax treats all misclassifications as equally bad. Predicting "labrador" when the answer is "golden retriever" gets the same loss as predicting "airplane". Hierarchical loss functions address this.

report a correction →

Multi-Label Classification

Definition

Multi-Label Classification

Each input $x$ has a label set $Y \subseteq \{1, \ldots, K\}$ that can contain zero, one, or multiple labels. The model outputs independent probabilities $p_k(x) = P(\text{label } k \text{ present} \mid X = x)$ for each label $k$ . These probabilities do not need to sum to 1.

Sigmoid Per Label

Apply an independent sigmoid to each logit:

$p_k = \sigma(z_k) = \frac{1}{1 + e^{-z_k}}$

Each label's probability is computed independently. The model can output $p_1 = 0.9$ and $p_2 = 0.8$ simultaneously (both labels likely present).

Loss: Binary Cross-Entropy Per Label

$\ell = -\frac{1}{K}\sum_{k=1}^K [y_k \log p_k + (1 - y_k) \log(1 - p_k)]$

This is the sum of $K$ independent binary classification losses.

Proposition

Multi-Label Independence Assumption

Statement

The standard multi-label loss with independent sigmoids assumes conditional independence of labels given the input:

$P(Y_1, \ldots, Y_K \mid X = x) = \prod_{k=1}^K P(Y_k \mid X = x)$

This factorization means the model cannot capture label correlations (e.g., "comedy" and "romantic" co-occurring more often than "comedy" and "horror") without the input features encoding this information.

Intuition

Each label has its own binary classifier. These classifiers share the feature extractor (backbone network) but make independent final predictions. Label correlations can only be captured through shared features, not through the output layer.

Proof Sketch

The likelihood under independent Bernoulli outputs factorizes: $\prod_k p_k^{y_k}(1-p_k)^{1-y_k}$ . Taking the negative log gives the sum of binary cross-entropies. This is the MLE objective under the independence assumption.

Why It Matters

If label correlations are strong, the independence assumption is limiting. Methods like classifier chains, graph neural networks on the label space, or transformer decoders that attend to previously predicted labels can capture these dependencies.

Failure Mode

With many correlated labels and small datasets, the independence assumption can lead to incoherent predictions (e.g., predicting "pregnant" and "male" simultaneously). Joint modeling of labels helps, but at higher computational cost.

report a correction →

Reduction Strategies

One-vs-Rest (OvR)

Train $K$ binary classifiers, one per class. Classifier $k$ predicts "class $k$ " vs "not class $k$ ". At inference, select the class with the highest confidence.

Pros: simple, parallelizable, works with any binary classifier
Cons: class imbalance (each binary classifier sees 1 positive class vs $K-1$ negative classes), assumes independence between classifiers

One-vs-One (OvO)

Train $K(K-1)/2$ binary classifiers, one for each pair of classes. Classifier $(j, k)$ distinguishes class $j$ from class $k$ . At inference, use majority voting.

Pros: each classifier sees balanced data, each is a simpler problem
Cons: $O(K^2)$ classifiers, expensive for large $K$

For neural networks, OvR and OvO are rarely used. A single network with a softmax output layer handles multi-class classification directly and shares features across classes.

Hierarchical Classification

When classes have a tree structure (e.g., taxonomy), predictions should respect the hierarchy. If the model predicts "golden retriever", it implicitly predicts "dog" and "animal".

Approaches:

Flat classification with hierarchy-aware loss (penalize less for nearby misclassifications in the tree)
Top-down classification: predict coarse class first, then refine
Per-level classifiers: one softmax per level of the hierarchy

Evaluation Metrics

Definition

Macro, Micro, and Weighted Averaging

For per-class metrics (precision $P_k$ , recall $R_k$ , F1 $F_k$ ):

Macro average: $\frac{1}{K} \sum_{k=1}^K F_k$ . Treats all classes equally.
Micro average: compute TP, FP, FN globally across all classes, then compute F1. Dominated by frequent classes.
Weighted average: $\sum_{k=1}^K w_k F_k$ where $w_k$ is the fraction of samples in class $k$ . Accounts for class imbalance.

Use macro averaging when all classes matter equally (even rare ones). Use micro averaging when overall accuracy matters more than per-class performance. Use weighted averaging as a compromise.

Thresholding and Decision Rules in Multi-Label Problems

The training objective gives probabilities, not a final label set. Turning those probabilities into decisions is a separate design choice.

Decision rule	What it does	When it makes sense
Global threshold (for example 0.5)	predict label $k$ when $p_k \ge \tau$	simple baseline when labels are similarly calibrated
Per-label thresholds	choose a different threshold for each label	label frequencies and calibration differ substantially
Top- $m$ labels	always keep the top few labels	useful when annotation policy expects a fixed number of labels
Ranked retrieval metrics	evaluate the ordering directly	recommendation or tagging systems where ranking matters more than a hard cutoff

The important point is that threshold selection belongs with evaluation and calibration, not with the definition of sigmoid itself. A well-trained multi-label model can still look weak if the thresholding rule is naive.

Common Confusions

Watch Out

Softmax is wrong for multi-label classification

Softmax forces probabilities to sum to 1, creating competition between labels. If one label's probability goes up, others must go down. For multi-label problems where multiple labels can be simultaneously correct, use independent sigmoids. This is the single most common mistake in classification setup.

Watch Out

Accuracy is misleading for multi-class with imbalanced classes

A model that always predicts the majority class achieves accuracy equal to the majority class frequency. With 90% class A and 10% class B, a trivial model gets 90% accuracy. Use per-class metrics (precision, recall) and appropriate averaging.

Watch Out

0.5 is not a sacred threshold in multi-label classification

The sigmoid output is a probability estimate, not an automatic decision rule. If labels have different prevalences, costs, or calibration quality, the best threshold for one label may be very different from the best threshold for another. Treat thresholding as a validation-set choice, not as something fixed forever by the activation function.

Exercises

ExerciseCore

Problem

You have a document classification problem where each document belongs to exactly one of 50 categories. Which output activation and loss function should you use? What if each document can belong to multiple categories simultaneously?

ExerciseAdvanced

Problem

You train a multi-class classifier with 100 classes using OvR (one-vs-rest). Each binary classifier achieves 95% accuracy. Can you conclude that the overall multi-class accuracy is 95%? Why or why not?

References

Canonical:

Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3
Goodfellow, Bengio, Courville, Deep Learning (2016), Section 6.2.2

Current:

Tsoumakas & Katakis, "Multi-Label Classification: An Overview" (2007), IJDWM
Zhang & Zhou, "A Review on Multi-Label Learning Algorithms" (2014), IEEE TKDE

Next Topics

Cross-entropy loss deep dive: the full theory behind the loss function
Confusion matrices and classification metrics: detailed evaluation for multi-class problems

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Logistic Regressionlayer 1 · tier 1

Derived topics

2

Confusion Matrices and Classification Metricslayer 1 · tier 1
Cross-Entropy Loss: MLE, KL Divergence, and Classificationlayer 1 · tier 1

Graph-backed continuations

Cross-Entropy Loss: MLE, KL Divergence, and Classification Confusion Matrices and Classification Metrics