Feature Importance and Interpretability

Sneiderman, Robby

Methodology

Feature Importance and Interpretability

Methods for attributing model predictions to input features: permutation importance, SHAP values, LIME, partial dependence, and why none of these imply causality.

CoreTier 2CurrentSupporting~50 min

Prerequisites

Decision Trees and Ensembles Linear Regression Exploratory Data Analysis Mechanistic Interpretability

Quiz (4)Prereq Map

Why This Matters

You built a model that predicts well. Now someone asks: "which features matter?" This question arises in every applied ML project, from regulatory compliance (finance, healthcare) to debugging models to scientific discovery. The methods here answer variants of this question, but each defines "importance" differently. Confusing these definitions leads to incorrect conclusions about what drives model behavior and, worse, incorrect causal claims.

Permutation Importance

Definition

Permutation Importance

The permutation importance of feature $j$ is the decrease in model performance when the values of feature $j$ are randomly shuffled across the dataset, breaking the association between feature $j$ and the target:

$\text{PI}_j = \text{Score}(f, X, y) - \text{Score}(f, X^{(\pi_j)}, y)$

where $X^{(\pi_j)}$ is $X$ with column $j$ permuted and Score is a performance metric (accuracy, $R^2$ , etc.).

Permutation importance has a clean interpretation: it measures how much the model relies on the marginal distribution of feature $j$ for its predictions. It works for any model and any metric. Compute it on the test set, not the training set, to avoid reflecting memorization.

Important subtlety: the definition above is marginal permutation importance. When features are correlated, shuffling one column can create impossible or very low-density feature combinations. The resulting score drop answers "how much does the model rely on this feature after I break the joint distribution?" not "how much unique information does this feature contribute while preserving dependence structure?" Grouped permutation or conditional permutation address different questions and can give materially different answers.

SHAP Values

Definition

Shapley Value $ϕ_{j}$

For a prediction $f(x)$ with features $\{1, \ldots, d\}$ , the Shapley value of feature $j$ is:

$\phi_j(x) = \sum_{S \subseteq \{1,\ldots,d\} \setminus \{j\}} \frac{|S|!(d - |S| - 1)!}{d!} \left[v(S \cup \{j\}) - v(S)\right]$

where $v(S)$ is the expected model output when only features in $S$ are observed and remaining features are marginalized out. The Shapley value attributes the prediction to each feature by averaging the marginal contribution of that feature across all possible orderings.

Theorem

Shapley Value Uniqueness

Statement

The Shapley value is the unique attribution method satisfying:

Efficiency: $\sum_{j=1}^d \phi_j(x) = f(x) - \mathbb{E}[f(X)]$
Symmetry: if features $j$ and $k$ contribute equally to all coalitions, $\phi_j = \phi_k$
Linearity: for combined games $v = v_1 + v_2$ , $\phi_j(v) = \phi_j(v_1) + \phi_j(v_2)$
Null player: if feature $j$ never changes any coalition value, $\phi_j = 0$

No other attribution method satisfies all four axioms simultaneously.

Intuition

Shapley values decompose the total prediction (minus baseline) into additive contributions, one per feature, in the only way that is fair, consistent, and complete. "Fair" means symmetric features get equal credit; "complete" means all credit is allocated.

Proof Sketch

Existence: the formula defines values satisfying all axioms (verify each). Uniqueness: assume two solutions $\phi$ and $\phi'$ both satisfy the axioms. By linearity, it suffices to prove uniqueness for unanimity games $v_S(T) = 1$ if $S \subseteq T$ , else 0. For unanimity games, efficiency and null player force $\phi_j = 1/|S|$ for $j \in S$ and $\phi_j = 0$ otherwise. Since any game is a linear combination of unanimity games, linearity extends uniqueness to all games.

Why It Matters

SHAP (SHapley Additive exPlanations) uses Shapley values for ML model explanations. The uniqueness theorem means that if you accept the four axioms as reasonable fairness requirements, there is exactly one way to attribute predictions. This gives SHAP a theoretical grounding that permutation importance and LIME lack.

Failure Mode

Computing exact Shapley values requires $2^d$ coalition evaluations. For models with hundreds of features, this is intractable. Practical SHAP implementations use approximations: KernelSHAP (sampling-based) or TreeSHAP (exact for tree models, $O(TLD^2)$ per prediction). These approximations can introduce errors, and KernelSHAP in particular may not converge with insufficient samples.

report a correction →

The catch is that the theorem is a statement about a fixed cooperative game $v(S)$ . In machine learning, the hard part is deciding what $v(S)$ means when features are "missing." Interventional SHAP breaks feature dependence by integrating absent features against a product-like reference distribution; conditional SHAP conditions on the observed features and keeps dependence structure. Both satisfy the Shapley axioms for their own value functions, so they can disagree without either implementation being "incorrect."

LIME

Proposition

LIME Local Fidelity

Statement

LIME finds an interpretable model $g$ that approximates $f$ locally:

$g^* = \arg\min_{g \in G} \sum_{z' \in \mathcal{Z}} \pi_x(z') \left(f(z') - g(z')\right)^2 + \Omega(g)$

where $\pi_x(z')$ is a kernel weighting proximity to $x$ , $\mathcal{Z}$ is a set of perturbed samples around $x$ , and $\Omega(g)$ is a complexity penalty. For linear $g$ , the coefficients serve as local feature importances.

Intuition

LIME asks: "near this specific input, which features does the model rely on?" It answers by fitting a simple linear model to the model's behavior in a local neighborhood. The linear coefficients tell you the local importance of each feature for this particular prediction.

Proof Sketch

The objective is weighted least squares with regularization. For linear $g$ , this has a closed-form solution: the ridge regression coefficients using the kernel-weighted design matrix of perturbed samples.

Why It Matters

LIME is model-agnostic and produces per-instance explanations. Unlike global methods (permutation importance), it captures local behavior: a feature might be important for one prediction and irrelevant for another.

Failure Mode

LIME depends heavily on the perturbation distribution and kernel width. For tabular data, perturbing features independently ignores feature correlations, producing unrealistic samples. For images, the superpixel segmentation determines what "features" LIME can identify. Different kernel widths produce different explanations for the same input. There is no principled way to choose these hyperparameters.

report a correction →

Partial Dependence Plots

A partial dependence plot (PDP) shows the marginal effect of a feature $x_j$ on the model output, averaging over all other features:

$\hat{f}_j(x_j) = \frac{1}{n}\sum_{i=1}^{n} f(x_j, x_{-j}^{(i)})$

PDPs are simple and global but their interpretation as a marginal effect implicitly relies on $x_j$ being approximately independent of $X_{-j}$ . The formula itself just averages predictions over the empirical marginal of $X_{-j}$ regardless of dependence. The problem under correlation is that this average evaluates the model at off-manifold feature combinations (e.g., height = 180cm with age = 3), so the resulting curve can be misleading even though the calculation is unambiguous.

Individual Conditional Expectation (ICE) plots show the same curve per instance rather than averaged, revealing heterogeneity in feature effects.

Accumulated Local Effects

Accumulated Local Effects (ALE) plots are often a better default than PDPs when predictors are correlated. Instead of averaging model predictions over the full marginal distribution of the other variables, ALE averages local changes in the prediction within bins of $x_j$ , then accumulates those local effects:

$\hat f_{j,\mathrm{ALE}}(x) = \int_{z_0}^{x} \mathbb{E}\!\left[\frac{\partial f(z, X_{-j})}{\partial z} \,\middle|\, X_j = z\right] dz - c$

in the population idealization, with finite-difference bin estimates in practice.

The key advantage is support-awareness: ALE only compares nearby values that occur where the data actually live, so it avoids many of the impossible feature combinations that make PDPs misleading under correlation. ALE still does not become causal because of this; it is a better descriptive plot of model behavior, not an intervention estimate.

Built-in Importance

Some models provide importance measures directly:

Tree impurity importance: total decrease in Gini impurity or entropy from splits on feature $j$ . Biased toward high-cardinality features.
Coefficient magnitude: in linear models, $|\beta_j|$ (after feature standardization). Only meaningful when features have comparable scales.
Attention weights: in transformers, often misinterpreted as importance. Attention is a computational mechanism, not an explanation of which inputs caused the output. Jain and Wallace (2019) showed attention frequently does not correlate with gradient-based importance.

The Causal Trap

Watch Out

Feature importance is not causal importance

All methods on this page answer: "what features does the model use?" None answer: "what features cause the outcome?" Permutation importance tells you the model relies on feature $j$ . It does not tell you that intervening on feature $j$ would change the outcome in the real world. A model predicting hospital mortality might rely on "has palliative care order" because that feature correlates with severity, not because palliative care causes death. Causal claims require causal methodology (randomized experiments, instrumental variables, do-calculus), not feature importance methods.

Watch Out

Global importance can hide local behavior

A feature with low global permutation importance might be the most important feature for a specific subgroup. Average importance masks heterogeneity. SHAP and LIME provide per-instance attributions that can reveal this, but aggregating them back to global summaries loses the same information.

Watch Out

Correlated features split importance unpredictably

If features $j$ and $k$ are highly correlated, permutation importance will underestimate both (shuffling one does not remove the information because the other is still intact). SHAP distributes importance between them according to the Shapley axioms, but the split may not match intuition. Neither method tells you which correlated feature is "truly important" because that is a causal question.

Watch Out

SHAP is unique only after you choose the missing-feature semantics

The Shapley uniqueness theorem does not say there is one universally correct SHAP explanation for a prediction. It says that for a chosen game $v(S)$ there is one additive attribution satisfying the axioms. In ML, different choices of $v(S)$ encode different notions of "feature absent": break dependence (interventional), preserve dependence (conditional), or use a particular background dataset. These choices can change the explanation substantially, especially when features are strongly correlated.

Summary

Permutation importance: global, model-agnostic, measures performance degradation when a feature's association with the target is broken
SHAP: per-instance, grounded in Shapley axioms (the unique fair attribution), but exponentially expensive to compute exactly
LIME: per-instance, local linear approximation, but sensitive to perturbation distribution and kernel width
PDPs: global marginal effect, but can be badly misleading under correlated predictors
ALE: support-aware alternative to PDPs that is usually safer under feature dependence
None of these methods imply causality. Feature importance != causal importance.

Exercises

ExerciseCore

Problem

You compute permutation importance for a random forest with two highly correlated features ( $\rho = 0.95$ ): temperature in Celsius and temperature in Fahrenheit. Both receive low importance scores. Explain why, and suggest an approach to address this.

ExerciseAdvanced

Problem

Prove that for a linear model $f(x) = \beta_0 + \sum_j \beta_j x_j$ with independent features, the SHAP value of feature $j$ for input $x$ is $\phi_j(x) = \beta_j(x_j - \mathbb{E}[X_j])$ . Verify the efficiency axiom.

References

Canonical:

Shapley, A Value for N-Person Games (1953). The cooperative-game foundation behind SHAP-style additive attributions.
Breiman, Random Forests (Machine Learning 45, 2001), Section 10. Classical source for permutation-style variable importance in forests.
Ribeiro, Singh, and Guestrin, Why Should I Trust You? Explaining the Predictions of Any Classifier (KDD 2016), Section 3. The original LIME paper.
Goldstein, Kapelner, Bleich, and Pitkin, Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation (JCGS 24, 2015). The ICE-paper source.

Current:

Lundberg and Lee, A Unified Approach to Interpreting Model Predictions (NeurIPS 2017), Sections 2-3. The SHAP paper that connects additive feature attribution to Shapley values.
Fisher, Rudin, and Dominici, All Models are Wrong, but Many are Useful: Variable Importance for Black-Box, Proprietary, or Misspecified Prediction Models (JMLR 20, 2019). Clean modern framing of model reliance / permutation-style importance.
Aas, Jullum, and Løland, Explaining Individual Predictions When Features Are Dependent: More Accurate Approximations to Shapley Values (Artificial Intelligence 298, 2021). The key reference on conditional-vs-interventional SHAP under dependence.
Apley and Zhu, Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models (JRSS B 82, 2020). The ALE paper and the standard critique of PDP under dependence.
Jain and Wallace, Attention is not Explanation (NAACL 2019). Why attention weights should not be casually read as feature importance.

Next Topics

Causal inference basics: the right framework once you want to ask what would happen under intervention rather than what the model relies on
Mechanistic interpretability: understanding model internals (circuits, features) rather than post-hoc attribution
Cross-validation theory: validating that your model and its explanations generalize

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Linear Regressionlayer 1 · tier 1
Mechanistic Interpretability: Features, Circuits, and Causal Faithfulnesslayer 4 · tier 1
Sparse Autoencoders for Interpretability: TopK, JumpReLU, Matryoshka, and Scalinglayer 4 · tier 1
Exploratory Data Analysislayer 1 · tier 2
Decision Trees and Ensembleslayer 2 · tier 2

Derived topics

2

Cross-Validation Theorylayer 2 · tier 2
Causal Inference Basicslayer 3 · tier 3

Graph-backed continuations

Causal Inference Basics Cross-Validation Theory