Skip to main content

ML Methods

Multi-Class and Multi-Label Classification

Multi-class (exactly one label, softmax + cross-entropy) vs multi-label (multiple labels, sigmoid + binary cross-entropy). One-vs-rest, one-vs-one, hierarchical classification, and evaluation metrics.

CoreTier 2StableSupporting~40 min

Prerequisites

Why This Matters

Binary classification is the special case. Real problems often involve multiple classes (image recognition with 1000 categories) or multiple labels per example (a movie that is both "comedy" and "romance"). The choice of output activation, loss function, and evaluation metric must match the problem structure. Using softmax for a multi-label problem is a common and costly mistake.

Setup Checklist

Problem shapeOutput layerTraining lossInference ruleTypical metrics
Exactly one label from KK classessoftmaxcategorical cross-entropyargmaxaccuracy, macro/micro F1, top-kk accuracy
Zero, one, or many labels from KK labelsindependent sigmoid per labelbinary cross-entropy summed over labelsthreshold or ranking rule per labelmacro/micro F1, exact match, Hamming loss, average precision

Multi-Class Classification

Definition

Multi-Class Classification

Each input xx has exactly one correct label y{1,,K}y \in \{1, \ldots, K\}. The model outputs a probability distribution over KK classes: pk(x)=P(Y=kX=x)p_k(x) = P(Y = k \mid X = x) with k=1Kpk(x)=1\sum_{k=1}^K p_k(x) = 1.

Softmax

The softmax function converts a vector of logits zRKz \in \mathbb{R}^K into a valid probability distribution:

pk=ezkj=1Kezjp_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}

Properties: all outputs are positive, they sum to 1, and increasing one logit decreases all other probabilities. This mutual exclusivity is correct for multi-class but wrong for multi-label.

Loss: Categorical Cross-Entropy

=logpc\ell = -\log p_c

where cc is the true class. This is the negative log-likelihood under a categorical distribution.

Proposition

Softmax Cross-Entropy Gradient

Statement

The gradient of the categorical cross-entropy loss with respect to the logits zkz_k has a remarkably simple form:

zk=pk1[k=c]\frac{\partial \ell}{\partial z_k} = p_k - \mathbb{1}[k = c]

where pk=softmax(z)kp_k = \text{softmax}(z)_k and cc is the true class index.

Intuition

The gradient is the difference between the predicted probability and the target (which is 1 for the true class, 0 for others). If the model assigns probability 0.8 to the correct class, the gradient for that logit is 0.81=0.20.8 - 1 = -0.2, pushing it up. All other logits have gradients equal to their predicted probabilities (positive, pushing them down). This is the same "prediction minus target" form as in linear and logistic regression.

Proof Sketch

For k=ck = c: /zc=logpc/zc=(1pc)=pc1\partial \ell / \partial z_c = -\partial \log p_c / \partial z_c = -(1 - p_c) = p_c - 1. For kck \neq c: /zk=logpc/zk=pk\partial \ell / \partial z_k = -\partial \log p_c / \partial z_k = p_k. Both cases are captured by pk1[k=c]p_k - \mathbb{1}[k = c].

Why It Matters

This clean gradient is why softmax + cross-entropy is the universal choice for multi-class classification. The gradient never saturates (unlike MSE with sigmoid) and is trivially cheap to compute.

Failure Mode

When classes are hierarchical (e.g., "golden retriever" is a "dog" is an "animal"), flat softmax treats all misclassifications as equally bad. Predicting "labrador" when the answer is "golden retriever" gets the same loss as predicting "airplane". Hierarchical loss functions address this.

Multi-Label Classification

Definition

Multi-Label Classification

Each input xx has a label set Y{1,,K}Y \subseteq \{1, \ldots, K\} that can contain zero, one, or multiple labels. The model outputs independent probabilities pk(x)=P(label k presentX=x)p_k(x) = P(\text{label } k \text{ present} \mid X = x) for each label kk. These probabilities do not need to sum to 1.

Sigmoid Per Label

Apply an independent sigmoid to each logit:

pk=σ(zk)=11+ezkp_k = \sigma(z_k) = \frac{1}{1 + e^{-z_k}}

Each label's probability is computed independently. The model can output p1=0.9p_1 = 0.9 and p2=0.8p_2 = 0.8 simultaneously (both labels likely present).

Loss: Binary Cross-Entropy Per Label

=1Kk=1K[yklogpk+(1yk)log(1pk)]\ell = -\frac{1}{K}\sum_{k=1}^K [y_k \log p_k + (1 - y_k) \log(1 - p_k)]

This is the sum of KK independent binary classification losses.

Proposition

Multi-Label Independence Assumption

Statement

The standard multi-label loss with independent sigmoids assumes conditional independence of labels given the input:

P(Y1,,YKX=x)=k=1KP(YkX=x)P(Y_1, \ldots, Y_K \mid X = x) = \prod_{k=1}^K P(Y_k \mid X = x)

This factorization means the model cannot capture label correlations (e.g., "comedy" and "romantic" co-occurring more often than "comedy" and "horror") without the input features encoding this information.

Intuition

Each label has its own binary classifier. These classifiers share the feature extractor (backbone network) but make independent final predictions. Label correlations can only be captured through shared features, not through the output layer.

Proof Sketch

The likelihood under independent Bernoulli outputs factorizes: kpkyk(1pk)1yk\prod_k p_k^{y_k}(1-p_k)^{1-y_k}. Taking the negative log gives the sum of binary cross-entropies. This is the MLE objective under the independence assumption.

Why It Matters

If label correlations are strong, the independence assumption is limiting. Methods like classifier chains, graph neural networks on the label space, or transformer decoders that attend to previously predicted labels can capture these dependencies.

Failure Mode

With many correlated labels and small datasets, the independence assumption can lead to incoherent predictions (e.g., predicting "pregnant" and "male" simultaneously). Joint modeling of labels helps, but at higher computational cost.

Reduction Strategies

One-vs-Rest (OvR)

Train KK binary classifiers, one per class. Classifier kk predicts "class kk" vs "not class kk". At inference, select the class with the highest confidence.

  • Pros: simple, parallelizable, works with any binary classifier
  • Cons: class imbalance (each binary classifier sees 1 positive class vs K1K-1 negative classes), assumes independence between classifiers

One-vs-One (OvO)

Train K(K1)/2K(K-1)/2 binary classifiers, one for each pair of classes. Classifier (j,k)(j, k) distinguishes class jj from class kk. At inference, use majority voting.

  • Pros: each classifier sees balanced data, each is a simpler problem
  • Cons: O(K2)O(K^2) classifiers, expensive for large KK

For neural networks, OvR and OvO are rarely used. A single network with a softmax output layer handles multi-class classification directly and shares features across classes.

Hierarchical Classification

When classes have a tree structure (e.g., taxonomy), predictions should respect the hierarchy. If the model predicts "golden retriever", it implicitly predicts "dog" and "animal".

Approaches:

  • Flat classification with hierarchy-aware loss (penalize less for nearby misclassifications in the tree)
  • Top-down classification: predict coarse class first, then refine
  • Per-level classifiers: one softmax per level of the hierarchy

Evaluation Metrics

Definition

Macro, Micro, and Weighted Averaging

For per-class metrics (precision PkP_k, recall RkR_k, F1 FkF_k):

  • Macro average: 1Kk=1KFk\frac{1}{K} \sum_{k=1}^K F_k. Treats all classes equally.
  • Micro average: compute TP, FP, FN globally across all classes, then compute F1. Dominated by frequent classes.
  • Weighted average: k=1KwkFk\sum_{k=1}^K w_k F_k where wkw_k is the fraction of samples in class kk. Accounts for class imbalance.

Use macro averaging when all classes matter equally (even rare ones). Use micro averaging when overall accuracy matters more than per-class performance. Use weighted averaging as a compromise.

Thresholding and Decision Rules in Multi-Label Problems

The training objective gives probabilities, not a final label set. Turning those probabilities into decisions is a separate design choice.

Decision ruleWhat it doesWhen it makes sense
Global threshold (for example 0.5)predict label kk when pkτp_k \ge \tausimple baseline when labels are similarly calibrated
Per-label thresholdschoose a different threshold for each labellabel frequencies and calibration differ substantially
Top-mm labelsalways keep the top few labelsuseful when annotation policy expects a fixed number of labels
Ranked retrieval metricsevaluate the ordering directlyrecommendation or tagging systems where ranking matters more than a hard cutoff

The important point is that threshold selection belongs with evaluation and calibration, not with the definition of sigmoid itself. A well-trained multi-label model can still look weak if the thresholding rule is naive.

Common Confusions

Watch Out

Softmax is wrong for multi-label classification

Softmax forces probabilities to sum to 1, creating competition between labels. If one label's probability goes up, others must go down. For multi-label problems where multiple labels can be simultaneously correct, use independent sigmoids. This is the single most common mistake in classification setup.

Watch Out

Accuracy is misleading for multi-class with imbalanced classes

A model that always predicts the majority class achieves accuracy equal to the majority class frequency. With 90% class A and 10% class B, a trivial model gets 90% accuracy. Use per-class metrics (precision, recall) and appropriate averaging.

Watch Out

0.5 is not a sacred threshold in multi-label classification

The sigmoid output is a probability estimate, not an automatic decision rule. If labels have different prevalences, costs, or calibration quality, the best threshold for one label may be very different from the best threshold for another. Treat thresholding as a validation-set choice, not as something fixed forever by the activation function.

Exercises

ExerciseCore

Problem

You have a document classification problem where each document belongs to exactly one of 50 categories. Which output activation and loss function should you use? What if each document can belong to multiple categories simultaneously?

ExerciseAdvanced

Problem

You train a multi-class classifier with 100 classes using OvR (one-vs-rest). Each binary classifier achieves 95% accuracy. Can you conclude that the overall multi-class accuracy is 95%? Why or why not?

References

Canonical:

  • Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Section 6.2.2

Current:

  • Tsoumakas & Katakis, "Multi-Label Classification: An Overview" (2007), IJDWM
  • Zhang & Zhou, "A Review on Multi-Label Learning Algorithms" (2014), IEEE TKDE

Next Topics

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

1