Skip to main content

Methodology

Confusion Matrix: MCC, Kappa, and Cost-Sensitive Evaluation

Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong.

CoreTier 1StableSupporting~40 min

Why This Matters

Predicted +Predicted -PredictedActual +Actual -Actual85TP15FN10FP890TNPrecisionTP/(TP+FP) = 0.895RecallTP/(TP+FN) = 0.850F1 Score2PR/(P+R) = 0.872Accuracy(TP+TN)/N = 0.975Accuracy is 97.5% but only 85% of actual positives are caught (recall). Precision and recall tell the real story.

Accuracy is a single number that hides critical failure modes. A model with 99% accuracy on a dataset where 99% of examples belong to one class has learned nothing. The confusion matrix is the complete record of what a classifier got right and wrong. Every useful classification metric (precision, recall, F1, MCC, kappa) is a function of the confusion matrix. If you cannot read a confusion matrix, you cannot evaluate a classifier.

Mental Model

A confusion matrix is a table. Rows correspond to true labels. Columns correspond to predicted labels. Entry (i,j)(i, j) counts how many examples with true label ii received predicted label jj. Diagonal entries are correct predictions. Off-diagonal entries are errors. The pattern of errors tells you how the model fails, not just how often.

Convention warning: some libraries (notably scikit-learn) use rows = true, columns = predicted. Others transpose this. Always check.

Formal Setup

Definition

Confusion matrix

For a KK-class classification problem with nn examples, the confusion matrix CC has entries:

Cij={x:true(x)=i and pred(x)=j}C_{ij} = |\{x : \text{true}(x) = i \text{ and } \text{pred}(x) = j\}|

The row sums jCij\sum_j C_{ij} give the number of examples in each true class. The column sums iCij\sum_i C_{ij} give the number of predictions for each class.

Per-Class Metrics

For class kk in a multi-class problem, define the one-vs-rest quantities:

  • True positives: TPk=Ckk\text{TP}_k = C_{kk}
  • False positives: FPk=ikCik\text{FP}_k = \sum_{i \neq k} C_{ik} (column kk minus diagonal)
  • False negatives: FNk=jkCkj\text{FN}_k = \sum_{j \neq k} C_{kj} (row kk minus diagonal)
  • True negatives: TNk=nTPkFPkFNk\text{TN}_k = n - \text{TP}_k - \text{FP}_k - \text{FN}_k

Then:

Precisionk=TPkTPk+FPk,Recallk=TPkTPk+FNk\text{Precision}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FP}_k}, \quad \text{Recall}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FN}_k}

Micro vs Macro Averaging

Definition

Macro averaging

Compute the metric for each class independently, then average:

Macro-Precision=1Kk=1KPrecisionk\text{Macro-Precision} = \frac{1}{K} \sum_{k=1}^{K} \text{Precision}_k

Macro averaging weights all classes equally regardless of their size. A rare class with 10 examples counts as much as a common class with 10,000 examples.

Definition

Micro averaging

Pool all TP, FP, FN across classes, then compute the metric once:

Micro-Precision=kTPkk(TPk+FPk)\text{Micro-Precision} = \frac{\sum_k \text{TP}_k}{\sum_k (\text{TP}_k + \text{FP}_k)}

For classification, micro-precision = micro-recall = micro-F1 = accuracy. Micro averaging is dominated by the largest classes.

The choice matters when classes are imbalanced. Macro averaging surfaces failures on rare classes. Micro averaging reflects overall performance weighted by class frequency.

Matthews Correlation Coefficient

Proposition

MCC as Geometric Mean of Regression Coefficients

Statement

The Matthews Correlation Coefficient for binary classification is:

MCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{MCC} = \frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}

This equals the Pearson correlation between the true binary labels and the predicted binary labels. It is also the geometric mean of the two regression coefficients (predicting true from predicted, and predicted from true): MCC=Informedness×Markedness\text{MCC} = \sqrt{\text{Informedness} \times \text{Markedness}}.

Intuition

MCC uses all four quadrants of the confusion matrix. It returns +1 for perfect prediction, 0 for random prediction, and -1 for total disagreement. Unlike F1, which ignores true negatives, MCC penalizes models that succeed on one class by failing on another.

Proof Sketch

Treat the 2×22 \times 2 confusion matrix as a contingency table. The Pearson correlation for binary variables reduces to the phi coefficient, which is exactly the MCC formula. The geometric mean identity follows from Informedness =TPRFPR= \text{TPR} - \text{FPR} and Markedness =PPVFOR= \text{PPV} - \text{FOR}, whose product equals MCC2\text{MCC}^2 by algebraic manipulation of the confusion matrix entries.

Why It Matters

MCC is the single best metric for binary classification on imbalanced datasets. Accuracy can be high despite useless predictions (predict the majority class always). F1 ignores true negatives. MCC accounts for all four cells and is invariant to which class is labeled positive.

Failure Mode

MCC is undefined when any row or column of the confusion matrix sums to zero (division by zero). This happens when the model predicts only one class or when one class is absent from the test set. In multi-class settings, the extension of MCC (the RKR_K coefficient) is less commonly used and harder to interpret.

Cohen's Kappa

Definition

Cohens kappa

Cohen's kappa measures agreement between predicted and true labels corrected for chance:

κ=pope1pe\kappa = \frac{p_o - p_e}{1 - p_e}

where pop_o is observed agreement (accuracy) and pep_e is expected agreement under independence:

pe=k=1K(jCkj)(iCik)n2p_e = \sum_{k=1}^{K} \frac{(\sum_j C_{kj})(\sum_i C_{ik})}{n^2}

κ=1\kappa = 1 means perfect agreement. κ=0\kappa = 0 means no better than random assignment that preserves marginal distributions.

Kappa is useful when the class distribution is skewed and accuracy is misleading. It adjusts for the fact that even a random classifier achieves nonzero accuracy when classes are imbalanced.

Cost-Sensitive Evaluation

Not all errors are equally bad. In medical diagnosis, a false negative (missing a disease) is typically worse than a false positive (unnecessary follow-up). A cost matrix WRK×KW \in \mathbb{R}^{K \times K} assigns a cost WijW_{ij} to predicting class jj when the true class is ii. Correct predictions have Wii=0W_{ii} = 0.

The total cost is:

Cost=i,jWijCij\text{Cost} = \sum_{i,j} W_{ij} \cdot C_{ij}

Accuracy implicitly uses Wij=1W_{ij} = 1 for iji \neq j and Wii=0W_{ii} = 0. Making the cost matrix explicit forces you to state what errors matter.

Canonical Examples

Example

Imbalanced binary classification

A disease screening test on 10,000 patients (100 positive, 9,900 negative):

Pred +Pred -
True +8020
True -2009,700

Accuracy: 9,780/10,000=97.8%9{,}780/10{,}000 = 97.8\%. Precision: 80/280=28.6%80/280 = 28.6\%. Recall: 80/100=80%80/100 = 80\%. MCC: (80970020020)/280100990097200.47(80 \cdot 9700 - 200 \cdot 20)/\sqrt{280 \cdot 100 \cdot 9900 \cdot 9720} \approx 0.47.

The model detects 80% of cases but most positive predictions are wrong. Accuracy alone makes this look good. MCC of 0.47 correctly indicates moderate performance.

Common Confusions

Watch Out

Rows vs columns convention varies across libraries

scikit-learn: rows = true labels, columns = predictions. Some textbooks and R packages use the transpose. TensorFlow's tf.math.confusion_matrix follows scikit-learn. Always verify by checking a known example where you know the true labels and predictions.

Watch Out

F1 and accuracy can both be misleading on imbalanced data

F1 ignores true negatives entirely. A model that predicts "positive" for everything has recall = 1 and nonzero F1, even though it is useless. MCC is more reliable because it uses all four cells. On balanced datasets, F1 and MCC give similar rankings.

Watch Out

Multi-class MCC is not the average of binary MCCs

Extending MCC to K>2K > 2 classes requires the generalized formula using the entire K×KK \times K confusion matrix, not averaging KK one-vs-rest binary MCCs. The generalization treats the confusion matrix as a contingency table and computes a multivariate correlation.

Multi-class Extension and Weighted Averaging

For K>2K > 2 classes, each of the binary metrics generalizes via three averaging strategies: macro, micro, and weighted. The first two are defined above; weighted averaging is a middle ground.

Definition

Weighted averaging

Compute the metric for each class, then take a weighted average where the weight for class kk is proportional to the number of true examples in that class:

Weighted-Precision=k{i:true(i)=k}Precisionkn\text{Weighted-Precision} = \frac{\sum_k |\{i : \text{true}(i) = k\}| \cdot \text{Precision}_k}{n}

Weighted averaging gives proportionally more influence to large classes while still distinguishing per-class performance. It is generally not equal to micro-averaging: in single-label multi-class classification, micro-precision = micro-recall = micro-F1 = accuracy, while weighted-precision uses per-class precisions weighted by the true-class frequencies and need not equal accuracy. Weighted-recall, however, does coincide with accuracy in this setting (because each example contributes to exactly one true class), so weighted-recall = micro-recall = accuracy while weighted-precision and weighted-F1 are typically different from their micro counterparts.

The three strategies answer different questions. Macro asks: how does the model perform on average across classes, treating each class as equally important? Weighted asks: how does the model perform on average, weighting by class frequency? Micro asks: what fraction of all predictions are correct?

For logistic regression applied to a KK-class problem via a softmax output layer, the per-class precision and recall correspond to the model's discrimination quality for each output node. Calibration and precision/recall are independent properties: calibration is a property of the predicted probabilities (do they match conditional frequencies?), while precision and recall are properties of the thresholded hard predictions (do they discriminate the classes?). A model can be perfectly calibrated yet have low precision/recall on a hard problem (the true posteriors are simply not concentrated near 0 or 1), and a model can have high precision/recall yet be badly miscalibrated (e.g., post-temperature-scaling logits that rank correctly but assign probabilities that are too sharp or too flat). For a thorough audit, evaluate calibration (reliability diagrams, ECE, Brier decompositions; see maximum likelihood estimation and proper scoring rules) separately from per-class precision and recall.

Multi-class MCC

The generalization of MCC to K>2K > 2 classes uses the entire confusion matrix directly. The RKR_K coefficient (Gorodkin 2004) is:

RK=k,l,mCkkClmCklCmk(k(lCkl)2(k,lCkl2)/K)(symmetric denominator term)R_K = \frac{\sum_{k,l,m} C_{kk} C_{lm} - C_{kl} C_{mk}}{\sqrt{\left(\sum_k \left(\sum_l C_{kl}\right)^2 - \left(\sum_{k,l} C_{kl}^2\right) / K\right) \cdot \text{(symmetric denominator term)}}}

This formula is unwieldy. In practice, two simpler alternatives are used:

  • Average of KK one-vs-rest binary MCCs (does not equal RKR_K, but is interpretable)
  • Multiclass phi coefficient via the matrix formula: MCC=tr(CC)+(tr C)2tr(CCT)(tr C)2tr(CTC)\text{MCC} = \frac{\text{tr}(C \cdot C) + (\text{tr } C)^2 - \text{tr}(C \cdot C^T)}{(\text{tr } C)^2 - \text{tr}(C^T C)} (under specific normalizations)

The key property that makes MCC attractive for multi-class settings is the same as the binary case: it is invariant to permutation of class labels and degrades gracefully when one class dominates.

Summary

  • The confusion matrix is the complete classification record; every metric derives from it
  • Macro averaging weights classes equally; micro averaging weights by class frequency; weighted averaging is proportional to class size
  • MCC uses all four quadrants and is the best single metric for imbalanced binary classification
  • Cohen's kappa corrects for chance agreement
  • Cost matrices make error asymmetry explicit
  • For multi-class problems, the choice among macro/micro/weighted must match the question being answered
  • Always check whether your library puts true labels in rows or columns

Exercises

ExerciseCore

Problem

A 3-class confusion matrix is:

Pred APred BPred C
True A4055
True B103010
True C0545

Compute the macro-averaged precision.

ExerciseAdvanced

Problem

Prove that for binary classification, micro-averaged precision equals accuracy.

References

Canonical:

  • Cohen, "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement 20(1), 1960
  • Matthews, "Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme," BBA 405(2), 1975
  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Section 7.2 (metrics for classification)
  • Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation," Journal of Machine Learning Technologies 2(1), 2011

Current:

  • Chicco & Jurman, "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy," BMC Genomics 21, 2020
  • Sokolova & Lapalme, "A systematic analysis of performance measures for classification tasks," Information Processing and Management 45(4), 2009

Next Topics

From confusion matrices, the natural continuations are:

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2