Confusion Matrix: MCC, Kappa, and Cost-Sensitive Evaluation

Sneiderman, Robby

Methodology

Confusion Matrix: MCC, Kappa, and Cost-Sensitive Evaluation

Beyond binary classification: multi-class confusion matrices, micro vs macro averaging, Matthews correlation coefficient, Cohen's kappa, cost-sensitive evaluation, and how to read a confusion matrix without getting the axes wrong.

CoreTier 1StableSupporting~40 min

Quiz (5)Pulse Check Prereq Map

Why This Matters

Accuracy is a single number that hides critical failure modes. A model with 99% accuracy on a dataset where 99% of examples belong to one class has learned nothing. The confusion matrix is the complete record of what a classifier got right and wrong. Every useful classification metric (precision, recall, F1, MCC, kappa) is a function of the confusion matrix. If you cannot read a confusion matrix, you cannot evaluate a classifier.

Mental Model

A confusion matrix is a table. Rows correspond to true labels. Columns correspond to predicted labels. Entry $(i, j)$ counts how many examples with true label $i$ received predicted label $j$ . Diagonal entries are correct predictions. Off-diagonal entries are errors. The pattern of errors tells you how the model fails, not just how often.

Convention warning: some libraries (notably scikit-learn) use rows = true, columns = predicted. Others transpose this. Always check.

Formal Setup

Definition

Confusion matrix $C \in Z_{\geq 0}^{K \times K}$

For a $K$ -class classification problem with $n$ examples, the confusion matrix $C$ has entries:

$C_{ij} = |\{x : \text{true}(x) = i \text{ and } \text{pred}(x) = j\}|$

The row sums $\sum_j C_{ij}$ give the number of examples in each true class. The column sums $\sum_i C_{ij}$ give the number of predictions for each class.

Per-Class Metrics

For class $k$ in a multi-class problem, define the one-vs-rest quantities:

True positives: $\text{TP}_k = C_{kk}$
False positives: $\text{FP}_k = \sum_{i \neq k} C_{ik}$ (column $k$ minus diagonal)
False negatives: $\text{FN}_k = \sum_{j \neq k} C_{kj}$ (row $k$ minus diagonal)
True negatives: $\text{TN}_k = n - \text{TP}_k - \text{FP}_k - \text{FN}_k$

Then:

$\text{Precision}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FP}_k}, \quad \text{Recall}_k = \frac{\text{TP}_k}{\text{TP}_k + \text{FN}_k}$

Micro vs Macro Averaging

Definition

Macro averaging

Compute the metric for each class independently, then average:

$\text{Macro-Precision} = \frac{1}{K} \sum_{k=1}^{K} \text{Precision}_k$

Macro averaging weights all classes equally regardless of their size. A rare class with 10 examples counts as much as a common class with 10,000 examples.

Definition

Micro averaging

Pool all TP, FP, FN across classes, then compute the metric once:

$\text{Micro-Precision} = \frac{\sum_k \text{TP}_k}{\sum_k (\text{TP}_k + \text{FP}_k)}$

For classification, micro-precision = micro-recall = micro-F1 = accuracy. Micro averaging is dominated by the largest classes.

The choice matters when classes are imbalanced. Macro averaging surfaces failures on rare classes. Micro averaging reflects overall performance weighted by class frequency.

Matthews Correlation Coefficient

Proposition

MCC as Geometric Mean of Regression Coefficients

Statement

The Matthews Correlation Coefficient for binary classification is:

$\text{MCC} = \frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}$

This equals the Pearson correlation between the true binary labels and the predicted binary labels. It is also the geometric mean of the two regression coefficients (predicting true from predicted, and predicted from true): $\text{MCC} = \sqrt{\text{Informedness} \times \text{Markedness}}$ .

Intuition

MCC uses all four quadrants of the confusion matrix. It returns +1 for perfect prediction, 0 for random prediction, and -1 for total disagreement. Unlike F1, which ignores true negatives, MCC penalizes models that succeed on one class by failing on another.

Proof Sketch

Treat the $2 \times 2$ confusion matrix as a contingency table. The Pearson correlation for binary variables reduces to the phi coefficient, which is exactly the MCC formula. The geometric mean identity follows from Informedness $= \text{TPR} - \text{FPR}$ and Markedness $= \text{PPV} - \text{FOR}$ , whose product equals $\text{MCC}^2$ by algebraic manipulation of the confusion matrix entries.

Why It Matters

MCC is the single best metric for binary classification on imbalanced datasets. Accuracy can be high despite useless predictions (predict the majority class always). F1 ignores true negatives. MCC accounts for all four cells and is invariant to which class is labeled positive.

Failure Mode

MCC is undefined when any row or column of the confusion matrix sums to zero (division by zero). This happens when the model predicts only one class or when one class is absent from the test set. In multi-class settings, the extension of MCC (the $R_K$ coefficient) is less commonly used and harder to interpret.

report a correction →

Cohen's Kappa

Definition

Cohens kappa $κ$

Cohen's kappa measures agreement between predicted and true labels corrected for chance:

$\kappa = \frac{p_o - p_e}{1 - p_e}$

where $p_o$ is observed agreement (accuracy) and $p_e$ is expected agreement under independence:

$p_e = \sum_{k=1}^{K} \frac{(\sum_j C_{kj})(\sum_i C_{ik})}{n^2}$

$\kappa = 1$ means perfect agreement. $\kappa = 0$ means no better than random assignment that preserves marginal distributions.

Kappa is useful when the class distribution is skewed and accuracy is misleading. It adjusts for the fact that even a random classifier achieves nonzero accuracy when classes are imbalanced.

Cost-Sensitive Evaluation

Not all errors are equally bad. In medical diagnosis, a false negative (missing a disease) is typically worse than a false positive (unnecessary follow-up). A cost matrix $W \in \mathbb{R}^{K \times K}$ assigns a cost $W_{ij}$ to predicting class $j$ when the true class is $i$ . Correct predictions have $W_{ii} = 0$ .

The total cost is:

$\text{Cost} = \sum_{i,j} W_{ij} \cdot C_{ij}$

Accuracy implicitly uses $W_{ij} = 1$ for $i \neq j$ and $W_{ii} = 0$ . Making the cost matrix explicit forces you to state what errors matter.

Canonical Examples

Example

Imbalanced binary classification

A disease screening test on 10,000 patients (100 positive, 9,900 negative):

	Pred +	Pred -
True +	80	20
True -	200	9,700

Accuracy: $9{,}780/10{,}000 = 97.8\%$ . Precision: $80/280 = 28.6\%$ . Recall: $80/100 = 80\%$ . MCC: $(80 \cdot 9700 - 200 \cdot 20)/\sqrt{280 \cdot 100 \cdot 9900 \cdot 9720} \approx 0.47$ .

The model detects 80% of cases but most positive predictions are wrong. Accuracy alone makes this look good. MCC of 0.47 correctly indicates moderate performance.

Common Confusions

Watch Out

Rows vs columns convention varies across libraries

scikit-learn: rows = true labels, columns = predictions. Some textbooks and R packages use the transpose. TensorFlow's tf.math.confusion_matrix follows scikit-learn. Always verify by checking a known example where you know the true labels and predictions.

Watch Out

F1 and accuracy can both be misleading on imbalanced data

F1 ignores true negatives entirely. A model that predicts "positive" for everything has recall = 1 and nonzero F1, even though it is useless. MCC is more reliable because it uses all four cells. On balanced datasets, F1 and MCC give similar rankings.

Watch Out

Multi-class MCC is not the average of binary MCCs

Extending MCC to $K > 2$ classes requires the generalized formula using the entire $K \times K$ confusion matrix, not averaging $K$ one-vs-rest binary MCCs. The generalization treats the confusion matrix as a contingency table and computes a multivariate correlation.

Multi-class Extension and Weighted Averaging

For $K > 2$ classes, each of the binary metrics generalizes via three averaging strategies: macro, micro, and weighted. The first two are defined above; weighted averaging is a middle ground.

Definition

Weighted averaging

Compute the metric for each class, then take a weighted average where the weight for class $k$ is proportional to the number of true examples in that class:

$\text{Weighted-Precision} = \frac{\sum_k |\{i : \text{true}(i) = k\}| \cdot \text{Precision}_k}{n}$

Weighted averaging gives proportionally more influence to large classes while still distinguishing per-class performance. It is generally not equal to micro-averaging: in single-label multi-class classification, micro-precision = micro-recall = micro-F1 = accuracy, while weighted-precision uses per-class precisions weighted by the true-class frequencies and need not equal accuracy. Weighted-recall, however, does coincide with accuracy in this setting (because each example contributes to exactly one true class), so weighted-recall = micro-recall = accuracy while weighted-precision and weighted-F1 are typically different from their micro counterparts.

The three strategies answer different questions. Macro asks: how does the model perform on average across classes, treating each class as equally important? Weighted asks: how does the model perform on average, weighting by class frequency? Micro asks: what fraction of all predictions are correct?

For logistic regression applied to a $K$ -class problem via a softmax output layer, the per-class precision and recall correspond to the model's discrimination quality for each output node. Calibration and precision/recall are independent properties: calibration is a property of the predicted probabilities (do they match conditional frequencies?), while precision and recall are properties of the thresholded hard predictions (do they discriminate the classes?). A model can be perfectly calibrated yet have low precision/recall on a hard problem (the true posteriors are simply not concentrated near 0 or 1), and a model can have high precision/recall yet be badly miscalibrated (e.g., post-temperature-scaling logits that rank correctly but assign probabilities that are too sharp or too flat). For a thorough audit, evaluate calibration (reliability diagrams, ECE, Brier decompositions; see maximum likelihood estimation and proper scoring rules) separately from per-class precision and recall.

Multi-class MCC

The generalization of MCC to $K > 2$ classes uses the entire confusion matrix directly. The $R_K$ coefficient (Gorodkin 2004) is:

$R_K = \frac{\sum_{k,l,m} C_{kk} C_{lm} - C_{kl} C_{mk}}{\sqrt{\left(\sum_k \left(\sum_l C_{kl}\right)^2 - \left(\sum_{k,l} C_{kl}^2\right) / K\right) \cdot \text{(symmetric denominator term)}}}$

This formula is unwieldy. In practice, two simpler alternatives are used:

Average of $K$ one-vs-rest binary MCCs (does not equal $R_K$ , but is interpretable)
Multiclass phi coefficient via the matrix formula: $\text{MCC} = \frac{\text{tr}(C \cdot C) + (\text{tr } C)^2 - \text{tr}(C \cdot C^T)}{(\text{tr } C)^2 - \text{tr}(C^T C)}$ (under specific normalizations)

The key property that makes MCC attractive for multi-class settings is the same as the binary case: it is invariant to permutation of class labels and degrades gracefully when one class dominates.

Summary

The confusion matrix is the complete classification record; every metric derives from it
Macro averaging weights classes equally; micro averaging weights by class frequency; weighted averaging is proportional to class size
MCC uses all four quadrants and is the best single metric for imbalanced binary classification
Cohen's kappa corrects for chance agreement
Cost matrices make error asymmetry explicit
For multi-class problems, the choice among macro/micro/weighted must match the question being answered
Always check whether your library puts true labels in rows or columns

Exercises

ExerciseCore

Problem

A 3-class confusion matrix is:

	Pred A	Pred B	Pred C
True A	40	5	5
True B	10	30	10
True C	0	5	45

Compute the macro-averaged precision.

ExerciseAdvanced

Problem

Prove that for binary classification, micro-averaged precision equals accuracy.

References

Canonical:

Cohen, "A Coefficient of Agreement for Nominal Scales," Educational and Psychological Measurement 20(1), 1960
Matthews, "Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme," BBA 405(2), 1975
Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Section 7.2 (metrics for classification)
Powers, "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation," Journal of Machine Learning Technologies 2(1), 2011

Current:

Chicco & Jurman, "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy," BMC Genomics 21, 2020
Sokolova & Lapalme, "A systematic analysis of performance measures for classification tasks," Information Processing and Management 45(4), 2009

Next Topics

From confusion matrices, the natural continuations are:

Cross-validation theory: how to generate reliable confusion matrices
Hypothesis testing for ML: testing whether one classifier is statistically better than another
Logistic regression: the canonical model whose outputs feed directly into a confusion matrix
Maximum likelihood estimation: calibration and proper scoring rules that go beyond accuracy

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

2

Cross-Validation Theorylayer 2 · tier 2
Hypothesis Testing for MLlayer 2 · tier 2

Graph-backed continuations

Cross-Validation Theory Hypothesis Testing for ML