ML Methods
Multi-Class and Multi-Label Classification
Multi-class (exactly one label, softmax + cross-entropy) vs multi-label (multiple labels, sigmoid + binary cross-entropy). One-vs-rest, one-vs-one, hierarchical classification, and evaluation metrics.
Prerequisites
Why This Matters
Binary classification is the special case. Real problems often involve multiple classes (image recognition with 1000 categories) or multiple labels per example (a movie that is both "comedy" and "romance"). The choice of output activation, loss function, and evaluation metric must match the problem structure. Using softmax for a multi-label problem is a common and costly mistake.
Setup Checklist
| Problem shape | Output layer | Training loss | Inference rule | Typical metrics |
|---|---|---|---|---|
| Exactly one label from classes | softmax | categorical cross-entropy | argmax | accuracy, macro/micro F1, top- accuracy |
| Zero, one, or many labels from labels | independent sigmoid per label | binary cross-entropy summed over labels | threshold or ranking rule per label | macro/micro F1, exact match, Hamming loss, average precision |
Multi-Class Classification
Multi-Class Classification
Each input has exactly one correct label . The model outputs a probability distribution over classes: with .
Softmax
The softmax function converts a vector of logits into a valid probability distribution:
Properties: all outputs are positive, they sum to 1, and increasing one logit decreases all other probabilities. This mutual exclusivity is correct for multi-class but wrong for multi-label.
Loss: Categorical Cross-Entropy
where is the true class. This is the negative log-likelihood under a categorical distribution.
Softmax Cross-Entropy Gradient
Statement
The gradient of the categorical cross-entropy loss with respect to the logits has a remarkably simple form:
where and is the true class index.
Intuition
The gradient is the difference between the predicted probability and the target (which is 1 for the true class, 0 for others). If the model assigns probability 0.8 to the correct class, the gradient for that logit is , pushing it up. All other logits have gradients equal to their predicted probabilities (positive, pushing them down). This is the same "prediction minus target" form as in linear and logistic regression.
Proof Sketch
For : . For : . Both cases are captured by .
Why It Matters
This clean gradient is why softmax + cross-entropy is the universal choice for multi-class classification. The gradient never saturates (unlike MSE with sigmoid) and is trivially cheap to compute.
Failure Mode
When classes are hierarchical (e.g., "golden retriever" is a "dog" is an "animal"), flat softmax treats all misclassifications as equally bad. Predicting "labrador" when the answer is "golden retriever" gets the same loss as predicting "airplane". Hierarchical loss functions address this.
Multi-Label Classification
Multi-Label Classification
Each input has a label set that can contain zero, one, or multiple labels. The model outputs independent probabilities for each label . These probabilities do not need to sum to 1.
Sigmoid Per Label
Apply an independent sigmoid to each logit:
Each label's probability is computed independently. The model can output and simultaneously (both labels likely present).
Loss: Binary Cross-Entropy Per Label
This is the sum of independent binary classification losses.
Multi-Label Independence Assumption
Statement
The standard multi-label loss with independent sigmoids assumes conditional independence of labels given the input:
This factorization means the model cannot capture label correlations (e.g., "comedy" and "romantic" co-occurring more often than "comedy" and "horror") without the input features encoding this information.
Intuition
Each label has its own binary classifier. These classifiers share the feature extractor (backbone network) but make independent final predictions. Label correlations can only be captured through shared features, not through the output layer.
Proof Sketch
The likelihood under independent Bernoulli outputs factorizes: . Taking the negative log gives the sum of binary cross-entropies. This is the MLE objective under the independence assumption.
Why It Matters
If label correlations are strong, the independence assumption is limiting. Methods like classifier chains, graph neural networks on the label space, or transformer decoders that attend to previously predicted labels can capture these dependencies.
Failure Mode
With many correlated labels and small datasets, the independence assumption can lead to incoherent predictions (e.g., predicting "pregnant" and "male" simultaneously). Joint modeling of labels helps, but at higher computational cost.
Reduction Strategies
One-vs-Rest (OvR)
Train binary classifiers, one per class. Classifier predicts "class " vs "not class ". At inference, select the class with the highest confidence.
- Pros: simple, parallelizable, works with any binary classifier
- Cons: class imbalance (each binary classifier sees 1 positive class vs negative classes), assumes independence between classifiers
One-vs-One (OvO)
Train binary classifiers, one for each pair of classes. Classifier distinguishes class from class . At inference, use majority voting.
- Pros: each classifier sees balanced data, each is a simpler problem
- Cons: classifiers, expensive for large
For neural networks, OvR and OvO are rarely used. A single network with a softmax output layer handles multi-class classification directly and shares features across classes.
Hierarchical Classification
When classes have a tree structure (e.g., taxonomy), predictions should respect the hierarchy. If the model predicts "golden retriever", it implicitly predicts "dog" and "animal".
Approaches:
- Flat classification with hierarchy-aware loss (penalize less for nearby misclassifications in the tree)
- Top-down classification: predict coarse class first, then refine
- Per-level classifiers: one softmax per level of the hierarchy
Evaluation Metrics
Macro, Micro, and Weighted Averaging
For per-class metrics (precision , recall , F1 ):
- Macro average: . Treats all classes equally.
- Micro average: compute TP, FP, FN globally across all classes, then compute F1. Dominated by frequent classes.
- Weighted average: where is the fraction of samples in class . Accounts for class imbalance.
Use macro averaging when all classes matter equally (even rare ones). Use micro averaging when overall accuracy matters more than per-class performance. Use weighted averaging as a compromise.
Thresholding and Decision Rules in Multi-Label Problems
The training objective gives probabilities, not a final label set. Turning those probabilities into decisions is a separate design choice.
| Decision rule | What it does | When it makes sense |
|---|---|---|
| Global threshold (for example 0.5) | predict label when | simple baseline when labels are similarly calibrated |
| Per-label thresholds | choose a different threshold for each label | label frequencies and calibration differ substantially |
| Top- labels | always keep the top few labels | useful when annotation policy expects a fixed number of labels |
| Ranked retrieval metrics | evaluate the ordering directly | recommendation or tagging systems where ranking matters more than a hard cutoff |
The important point is that threshold selection belongs with evaluation and calibration, not with the definition of sigmoid itself. A well-trained multi-label model can still look weak if the thresholding rule is naive.
Common Confusions
Softmax is wrong for multi-label classification
Softmax forces probabilities to sum to 1, creating competition between labels. If one label's probability goes up, others must go down. For multi-label problems where multiple labels can be simultaneously correct, use independent sigmoids. This is the single most common mistake in classification setup.
Accuracy is misleading for multi-class with imbalanced classes
A model that always predicts the majority class achieves accuracy equal to the majority class frequency. With 90% class A and 10% class B, a trivial model gets 90% accuracy. Use per-class metrics (precision, recall) and appropriate averaging.
0.5 is not a sacred threshold in multi-label classification
The sigmoid output is a probability estimate, not an automatic decision rule. If labels have different prevalences, costs, or calibration quality, the best threshold for one label may be very different from the best threshold for another. Treat thresholding as a validation-set choice, not as something fixed forever by the activation function.
Exercises
Problem
You have a document classification problem where each document belongs to exactly one of 50 categories. Which output activation and loss function should you use? What if each document can belong to multiple categories simultaneously?
Problem
You train a multi-class classifier with 100 classes using OvR (one-vs-rest). Each binary classifier achieves 95% accuracy. Can you conclude that the overall multi-class accuracy is 95%? Why or why not?
References
Canonical:
- Bishop, Pattern Recognition and Machine Learning (2006), Chapter 4.3
- Goodfellow, Bengio, Courville, Deep Learning (2016), Section 6.2.2
Current:
- Tsoumakas & Katakis, "Multi-Label Classification: An Overview" (2007), IJDWM
- Zhang & Zhou, "A Review on Multi-Label Learning Algorithms" (2014), IEEE TKDE
Next Topics
- Cross-entropy loss deep dive: the full theory behind the loss function
- Confusion matrices and classification metrics: detailed evaluation for multi-class problems
Last reviewed: April 23, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Logistic Regressionlayer 1 · tier 1
Derived topics
2- Confusion Matrices and Classification Metricslayer 1 · tier 1
- Cross-Entropy Loss: MLE, KL Divergence, and Classificationlayer 1 · tier 1
Graph-backed continuations