Skip to main content

ML Methods

Principal Component Analysis

Dimensionality reduction via variance maximization: PCA as eigendecomposition of the covariance matrix, PCA as truncated SVD of the centered data matrix, reconstruction error, and when sample PCA works.

CoreTier 1StableCore spine~55 min

Why This Matters

Data with principal componentsPC1 (78%)PC2 (15%)Variance explained78%PC115%PC24%PC32%PC41%PC5cumulative20%40%60%80%

PCA is the most widely used dimensionality reduction technique in all of data science. It appears everywhere: preprocessing for ML pipelines, visualization (project to 2D), noise reduction, feature extraction, genomics (population structure), finance (factor models), and image compression. Understanding PCA requires understanding the connection between three mathematical objects: the covariance matrix, its eigendecomposition, and the SVD of the data matrix — and knowing exactly how they relate.

Mental Model

You have nn data points in Rd\mathbb{R}^d and want to find a low-dimensional subspace that captures as much of the variation in the data as possible. PCA finds the directions (principal components) along which the data varies most, ordered by decreasing variance. Projecting onto the top kk principal components gives the best rank-kk approximation to the centered data.

Setup and Notation

Let XRn×dX \in \mathbb{R}^{n \times d} be the data matrix with each row xiTx_i^T a data point. Assume the data is centered: xˉ=1nixi=0\bar{x} = \frac{1}{n}\sum_i x_i = 0. If not, subtract the mean first.

Definition

Sample Covariance Matrix

The sample covariance matrix is:

Σ^=1nXTXRd×d\hat{\Sigma} = \frac{1}{n} X^T X \in \mathbb{R}^{d \times d}

This is symmetric positive semi-definite. Its eigenvalues λ1λ2λd0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0 are the variances along the principal directions.

Definition

Principal Components

The principal components are the eigenvectors v1,v2,,vdv_1, v_2, \ldots, v_d of Σ^\hat{\Sigma}, ordered by decreasing eigenvalue. The kk-th principal component direction vkv_k is the direction of the kk-th largest variance in the data, subject to being orthogonal to v1,,vk1v_1, \ldots, v_{k-1}.

PCA as Variance Maximization

Theorem

PCA Maximizes Projected Variance

Statement

The first principal component v1v_1 solves:

v1=argmaxv=11ni=1n(vTxi)2=argmaxv=1vTΣ^vv_1 = \arg\max_{\|v\|=1} \frac{1}{n}\sum_{i=1}^n (v^T x_i)^2 = \arg\max_{\|v\|=1} v^T \hat{\Sigma} v

The maximum value is λ1\lambda_1, the largest eigenvalue of Σ^\hat{\Sigma}. More generally, the kk-th principal component maximizes projected variance subject to orthogonality to the first k1k-1 components.

Intuition

Among all unit directions in Rd\mathbb{R}^d, v1v_1 is the one along which the data has the most spread (variance). The second PC is the direction of most remaining spread after removing the v1v_1 component, and so on. Each eigenvalue tells you how much variance that direction captures.

Proof Sketch

We want maxv=1vTΣ^v\max_{\|v\|=1} v^T \hat{\Sigma} v. Form the Lagrangian L=vTΣ^vλ(vTv1)L = v^T \hat{\Sigma} v - \lambda(v^T v - 1). Setting vL=0\nabla_v L = 0 gives Σ^v=λv\hat{\Sigma} v = \lambda v, an eigenvalue equation. The objective at any eigenvector vkv_k equals λk\lambda_k. So the maximum is λ1\lambda_1 achieved at v1v_1. For subsequent PCs, add orthogonality constraints and use induction.

Why It Matters

This shows PCA has a clear statistical interpretation: it finds the directions that explain the most variance in the data. The eigenvalues quantify exactly how much variance each component captures, enabling the scree plot for choosing the number of components.

Failure Mode

PCA maximizes variance, not necessarily "importance." If the signal lives in a low-variance direction (e.g., a rare but meaningful feature), PCA will discard it in favor of high-variance noise. PCA also only captures linear structure; it cannot find nonlinear manifolds.

Worked Derivation: PCA as Reconstruction Error

The variance-maximization statement is the short route. The reconstruction route is the one that often feels harder: it starts from a squared-error projection objective, expands it with trace identities, and ends at the same eigenvector problem.

Use a unit direction uRdu \in \mathbb{R}^d with uu=1u^\top u=1. The score vector XuRnXu \in \mathbb{R}^n gives one coordinate per example. The reconstruction back in the original feature space is:

X^=XuuRn×d.\hat{X} = Xuu^\top \in \mathbb{R}^{n\times d}.

The one-dimensional PCA reconstruction problem is:

argminuu=1XXuuF2.\underset{u^\top u=1}{\operatorname{argmin}}\, \lVert X - Xuu^\top\rVert_F^2.

Using AF2=Tr(AA)\lVert A\rVert_F^2=\operatorname{Tr}(A^\top A),

XXuuF2=Tr ⁣((XXuu)(XXuu)).\lVert X-Xuu^\top\rVert_F^2 = \operatorname{Tr}\!\left((X-Xuu^\top)^\top(X-Xuu^\top)\right).

Expand the product:

XXuuF2=Tr ⁣(XXXXuuuuXX+uuXXuu).\begin{aligned} \lVert X-Xuu^\top\rVert_F^2 &= \operatorname{Tr}\!\left( X^\top X -X^\top Xuu^\top -uu^\top X^\top X +uu^\top X^\top Xuu^\top \right). \end{aligned}

The trace trick is only bookkeeping. It lets scalar expressions move into a shape that is easy to compare:

Tr(XXuu)=Tr(uXXu)=uXXu.\operatorname{Tr}(X^\top Xuu^\top) = \operatorname{Tr}(u^\top X^\top Xu) = u^\top X^\top Xu.

The first term Tr(XX)\operatorname{Tr}(X^\top X) is constant in uu, so it does not affect the minimizer. The two middle terms are equal by cyclicity of trace. The last term simplifies because uu has unit norm:

(uu)(uu)=u(uu)u=uu.(uu^\top)(uu^\top)=u(u^\top u)u^\top=uu^\top.

So the variable-dependent part is:

2uXXu+uXXu=uXXu.-2u^\top X^\top Xu + u^\top X^\top Xu = -u^\top X^\top Xu.

Minimizing reconstruction error is therefore equivalent to:

argmaxuu=1uXXu.\underset{u^\top u=1}{\operatorname{argmax}}\, u^\top X^\top Xu.

By the Rayleigh-Ritz theorem, the maximizing unit vector is an eigenvector of XXX^\top X with the largest eigenvalue. This is the first principal component. Because Σ^=XX/n\hat{\Sigma}=X^\top X/n, the same vector is also the top eigenvector of the sample covariance matrix; scaling by 1/n1/n changes eigenvalues, not eigenvectors.

For k>1k > 1, write D=[u1,,uk]Rd×kD=[u_1,\ldots,u_k]\in\mathbb{R}^{d\times k} with DD=IkD^\top D=I_k. The reconstruction is XDDXDD^\top, and the same calculation gives:

argmaxDD=IkTr(DXXD).\underset{D^\top D=I_k}{\operatorname{argmax}}\, \operatorname{Tr}(D^\top X^\top XD).

Let XX=VΛVX^\top X=V\Lambda V^\top, with eigenvalues λ1λd\lambda_1\geq\cdots\geq\lambda_d. This is the exercise often left after the one-vector derivation: choose the best orthonormal kk-frame. In the eigenbasis of XXX^\top X, write Y=VDY=V^\top D. Then YY=IkY^\top Y=I_k, and

Tr(DXXD)=Tr(YΛY)=r=1dλrαr,\operatorname{Tr}(D^\top X^\top XD) = \operatorname{Tr}(Y^\top \Lambda Y) = \sum_{r=1}^d \lambda_r \alpha_r,

where αr=j=1kYrj2\alpha_r=\sum_{j=1}^k Y_{rj}^2, so 0αr10\leq \alpha_r\leq 1 and rαr=k\sum_r\alpha_r=k. Because the eigenvalues are sorted, this weighted sum is maximized by placing weight 11 on the first kk eigenvalues and 00 on the rest. Thus D=[v1,,vk]D=[v_1,\ldots,v_k] maximizes the objective, with value λ1++λk\lambda_1+\cdots+\lambda_k. This is the Ky Fan/Rayleigh-Ritz argument; the same idea can be read as induction on orthogonal complements.

Geometrically, one-component PCA picks the line through the mean with the smallest squared reconstruction arrows. Top-kk PCA picks the kk-dimensional orthogonal subspace with the smallest total squared reconstruction arrows.

PCA via SVD

The SVD of the centered data matrix X=UΣVTX = U \Sigma V^T where URn×nU \in \mathbb{R}^{n \times n}, ΣRn×d\Sigma \in \mathbb{R}^{n \times d}, and VRd×dV \in \mathbb{R}^{d \times d}.

The connection: Σ^=1nXTX=VΣTΣnVT\hat{\Sigma} = \frac{1}{n}X^T X = V \frac{\Sigma^T \Sigma}{n} V^T.

So the right singular vectors VV are the principal component directions, and the squared singular values divided by nn are the eigenvalues of Σ^\hat{\Sigma}: λk=σk2/n\lambda_k = \sigma_k^2 / n.

In practice, always compute PCA via the SVD of XX, not by forming XTXX^T X and computing its eigendecomposition. The SVD is numerically more stable because forming XTXX^T X squares the condition number.

Truncated PCA and Reconstruction

Definition

Truncated PCA

The rank-kk PCA approximation keeps only the top kk principal components. The projection of a data point xx onto the kk-dimensional subspace is:

x~=VkVkTx\tilde{x} = V_k V_k^T x

where Vk=[v1,,vk]Rd×kV_k = [v_1, \ldots, v_k] \in \mathbb{R}^{d \times k}.

Theorem

Eckart-Young Theorem (Best Low-Rank Approximation)

Statement

The best rank-kk approximation to XX in the Frobenius norm is the truncated SVD Xk=UkΣkVkTX_k = U_k \Sigma_k V_k^T:

Xk=argminrank(M)kXMF2X_k = \arg\min_{\text{rank}(M) \leq k} \|X - M\|_F^2

The reconstruction error is:

XXkF2=j=k+1rσj2\|X - X_k\|_F^2 = \sum_{j=k+1}^{r} \sigma_j^2

where r=rank(X)r = \text{rank}(X).

Intuition

Truncated PCA gives you the closest rank-kk matrix to your data. The reconstruction error is exactly the sum of squared singular values you threw away. This is why you look at the eigenvalue spectrum to decide how many components to keep.

Proof Sketch

By the SVD, X=jσjujvjTX = \sum_j \sigma_j u_j v_j^T. Any rank-kk matrix can capture at most kk singular directions. The Frobenius norm squared of XX is jσj2\sum_j \sigma_j^2 (Parseval). Keeping the largest kk singular values minimizes j=k+1rσj2\sum_{j=k+1}^r \sigma_j^2.

Why It Matters

This theorem justifies truncated PCA as optimal dimensionality reduction in the least-squares sense. It also gives a precise formula for reconstruction error in terms of the discarded eigenvalues, which is what the scree plot visualizes.

Failure Mode

Optimality is in the Frobenius norm sense. If your downstream task cares about something other than squared reconstruction error (e.g., classification accuracy), PCA may not give the best low-dimensional representation.

Choosing the Number of Components

Scree plot: Plot eigenvalues λ1,λ2,\lambda_1, \lambda_2, \ldots versus index. Look for an "elbow" where the eigenvalues drop sharply and then level off. Keep components before the elbow.

Proportion of variance explained: Keep kk components such that:

j=1kλjj=1dλj0.95 (or some threshold)\frac{\sum_{j=1}^k \lambda_j}{\sum_{j=1}^d \lambda_j} \geq 0.95 \text{ (or some threshold)}

Kaiser's rule: Keep components with λk>λˉ\lambda_k > \bar{\lambda} (the average eigenvalue). For correlation PCA, this means λk>1\lambda_k > 1.

Connection to Matrix Concentration

When does sample PCA approximate population PCA? If xiDx_i \sim \mathcal{D} with population covariance Σ\Sigma, then Σ^Σ\hat{\Sigma} \to \Sigma as nn \to \infty. But the rate matters.

Matrix concentration inequalities (Vershynin 2018, Theorem 4.6.1) show that for sub-Gaussian data with population covariance Σ\Sigma, Σ^ΣopCΣop(d/n+d/n)\|\hat{\Sigma} - \Sigma\|_{\text{op}} \leq C \|\Sigma\|_{\text{op}} \big(\sqrt{d/n} + d/n\big) with high probability. The d/n\sqrt{d/n} term dominates in the classical regime dnd \ll n and matches the rate quoted in elementary references; the d/nd/n term takes over once dd is comparable to nn and is the right object to track in modern high-dimensional analyses. The Davis-Kahan theorem then bounds the angle between sample and population eigenvectors. PCA is reliable when ndn \gg d and the eigenvalue gaps are large. In high-dimensional settings (dnd \sim n or dnd \gg n), sample PCA can be highly misleading: the top sample eigenvector may point in a completely wrong direction. This is one instance of the broader phenomenon where classical estimators break down when d/nd/n is not small; see double descent for analogous behavior in supervised learning.

Optional Extensions: Beyond Classical PCA

Classical PCA is a deterministic eigen-decomposition of the sample covariance. Five common variants relax or replace pieces of that recipe and earn their place under different conditions. Each is shipped here as a self-contained mini-section so this page can serve as a single landing for the family.

Probabilistic PCA (PPCA)

Tipping and Bishop (1999) gave PCA an explicit generative-model interpretation:

x=Wz+μ+ε,zN(0,Ik),εN(0,σ2Id),x = W z + \mu + \varepsilon, \qquad z \sim \mathcal{N}(0, I_k), \quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I_d),

with latent zRkz \in \mathbb{R}^k, loading matrix WRd×kW \in \mathbb{R}^{d \times k}, and isotropic noise σ2\sigma^2. Marginalizing out zz:

xN(μ,WW+σ2Id).x \sim \mathcal{N}(\mu, W W^\top + \sigma^2 I_d).

The MLE for WW is WML=Uk(Λkσ2Ik)1/2RW_{\text{ML}} = U_k (\Lambda_k - \sigma^2 I_k)^{1/2} R, where UkU_k holds the top-kk sample-covariance eigenvectors, Λk\Lambda_k the top-kk eigenvalues, and RR is an arbitrary k×kk \times k orthogonal matrix (rotational ambiguity). The MLE for σ2\sigma^2 is the average of the discarded eigenvalues. As σ20\sigma^2 \to 0, the model collapses to classical PCA.

Why it matters. PPCA gives PCA a probabilistic vocabulary: a posterior over latents (useful for missing-data imputation), a marginal likelihood (model selection by Bayesian approaches), and natural EM fitting when data is partially observed. The MLE for WW is also a clean application of maximum-likelihood estimation to a latent-variable model. PPCA is the bridge from classical PCA to factor analysis, Gaussian-mixture latent-variable models, and Bayesian variants. Use PPCA when you need a generative prior over the data or have missing entries to handle.

Kernel PCA

Schölkopf, Smola, and Müller (1998) extended PCA to nonlinear feature spaces via the kernel trick. Replace the inner product xi,xj\langle x_i, x_j \rangle with a kernel k(xi,xj)=ϕ(xi),ϕ(xj)k(x_i, x_j) = \langle \phi(x_i), \phi(x_j) \rangle for some feature map ϕ\phi. PCA in feature space requires only the centered Gram matrix Kc=(I11/n)K(I11/n)K_c = (I - \mathbf{1}\mathbf{1}^\top/n) K (I - \mathbf{1}\mathbf{1}^\top/n), where Kij=k(xi,xj)K_{ij} = k(x_i, x_j). Eigendecompose KcK_c to get coefficients; the kk-th nonlinear principal component evaluated at a new point xx is

ϕ(x),vk=i=1nαk,ik(xi,x),\langle \phi(x), v_k \rangle = \sum_{i=1}^n \alpha_{k,i}\, k(x_i, x),

with αk\alpha_k the (rescaled) eigenvector of KcK_c.

Why it matters. Kernel PCA captures nonlinear structure (curved manifolds, clusters in feature space) that classical PCA cannot resolve in the original coordinates. The cost is O(n2)O(n^2) kernel evaluations and an n×nn \times n eigendecomposition rather than d×dd \times d. Use kernel PCA when the nonlinear structure is dominant and nn is moderate; for very large nn, Nyström approximations or random-feature methods scale it.

Factor Analysis vs PCA

Factor analysis (FA) shares the x=Wz+μ+εx = Wz + \mu + \varepsilon structure but allows anisotropic noise:

εN(0,Ψ),Ψ=diag(ψ1,,ψd).\varepsilon \sim \mathcal{N}(0, \Psi), \qquad \Psi = \operatorname{diag}(\psi_1, \ldots, \psi_d).

Each variable has its own residual variance ψj\psi_j. The loading matrix WW captures common variance shared across variables; Ψ\Psi captures variable-specific noise.

Why it matters. PCA absorbs all variance into the top components, so a single high-variance noisy variable will dominate the first PC even if it shares no signal with the others. FA separates the shared (common) variance from the variable-specific (unique) variance. In psychometrics, signal-detection settings, and any case where variables have heterogeneous noise scales, FA is the right model and PCA is at best a fast approximation. Conversely, when the variables are roughly homoscedastic and you only want low-dimensional reconstruction, PCA is simpler and faster.

Sparse PCA

Classical PCA produces dense loading vectors: each PC mixes contributions from every original variable, which destroys interpretability when dd is large. Sparse PCA (Zou, Hastie, Tibshirani 2006) adds an 1\ell_1 penalty to the loading vector:

maxv  vΣ^vλv1,v21.\max_v\; v^\top \hat{\Sigma} v - \lambda \|v\|_1, \qquad \|v\|_2 \le 1.

The penalty zeros out small loadings; the resulting PC depends on a small subset of variables, making it interpretable. The objective is non-convex; solvers use alternating gradient or semidefinite relaxations.

Why it matters. Use sparse PCA in high-dimensional settings (genomics, text, network features) where you want a small set of identifiable variables driving each component. The cost is rotational ambiguity (each sparse PC depends on the regularization path) and a non-convex optimization. Cross-validation on λ\lambda is standard. The penalty mechanism is the same as in lasso regression; sparse PCA is the unsupervised counterpart.

Robust PCA

Classical PCA's 2\ell_2 objective is sensitive to outliers: a single corrupted observation can rotate the top component substantially. Robust PCA (Candès, Li, Ma, Wright 2011) decomposes the data matrix XX as a sum of a low-rank component LL (the "true" signal) and a sparse component SS (the corruptions):

minL,S  L+λS1s.t.X=L+S,\min_{L, S}\; \|L\|_* + \lambda \|S\|_1 \quad \text{s.t.} \quad X = L + S,

where \|\cdot\|_* is the nuclear norm (sum of singular values, a convex surrogate for rank) and 1\|\cdot\|_1 is entry-wise. The decomposition recovers LL exactly under low-rank-plus-sparse identifiability conditions on XX.

Why it matters. Use robust PCA when outliers are gross (entry-level corruption, missing-or-corrupt entries, or rare adversarial perturbations) rather than Gaussian noise. The classic application is video background-foreground separation: the static background is the low-rank LL; the moving foreground is the sparse SS. The low-rank component is computed via SVD inside the iterative solver, so robust PCA inherits the SVD-based intuition while replacing the simple top-kk truncation with a corruption-aware decomposition. Modern variants extend to streaming and tensor decompositions.

Summary table

VariantReplacesUse when
PPCADeterministic eigendecomposition with a Gaussian latent-variable modelYou need a generative prior, EM for missing data, or a marginal likelihood
Kernel PCALinear inner product with a kernelNonlinear structure dominates and nn is moderate
Factor AnalysisIsotropic noise σ2I\sigma^2 I with anisotropic noise diag(ψj)\operatorname{diag}(\psi_j)Heterogeneous variable-specific noise scales
Sparse PCADense loadings with 1\ell_1-penalized loadingsHigh-dd interpretability matters
Robust PCA2\ell_2 objective with low-rank-plus-sparse decompositionGross outliers (not Gaussian noise)

Each variant is a separate research thread; the references at the end of this page point to the foundational papers.

Canonical Examples

Example

PCA on 2D data

Suppose nn points in R2\mathbb{R}^2 form an elliptical cloud with major axis along (1/2,1/2)(1/\sqrt{2}, 1/\sqrt{2}) and minor axis along (1/2,1/2)(-1/\sqrt{2}, 1/\sqrt{2}). The first PC is the major axis direction (most variance), and the second PC is the minor axis. Projecting onto just the first PC collapses the data to a line along the major axis, retaining the most variance possible in one dimension.

Common Confusions

Watch Out

PCA and SVD are NOT the same thing

This is the most common confusion. SVD is a matrix factorization: X=UΣVTX = U\Sigma V^T. PCA is a statistical procedure that involves (1) centering the data and (2) finding directions of maximum variance. PCA uses SVD (of the centered data matrix) as its computational engine, but they are different concepts. SVD does not center the data. If you run SVD on uncentered data, you do not get PCA. The eigenvalues of Σ^\hat{\Sigma} are σk2/n\sigma_k^2/n, not σk\sigma_k.

Watch Out

PCA components are not features

The principal components are linear combinations of the original features. The first PC is v1Tx=jv1jxjv_1^T x = \sum_j v_{1j} x_j, which mixes all original features. Interpreting what a principal component "means" requires examining the loadings v1jv_{1j}. High-dimensional PCA components are often uninterpretable.

Watch Out

PCA is not every dimensionality reduction method

PCA is not the same question as LDA, NMF, t-SNE, or UMAP. PCA is linear, unsupervised, and variance/reconstruction based. LDA is supervised and uses labels to separate classes. Non-negative matrix factorization constrains factors to be nonnegative, which can give parts-based decompositions. t-SNE and UMAP are nonlinear visualization methods that emphasize local neighborhoods and may distort global distances.

Summary

  • PCA finds directions of maximum variance: maxv=1vTΣ^v\max_{\|v\|=1} v^T \hat{\Sigma} v
  • The reconstruction derivation reduces minu=1XXuuF2\min_{\|u\|=1}\lVert X-Xuu^\top\rVert_F^2 to a Rayleigh quotient
  • Principal components = eigenvectors of Σ^=1nXTX\hat{\Sigma} = \frac{1}{n}X^T X
  • Eigenvalues = variances along principal directions
  • Compute via SVD of XX (not eigendecomposition of XTXX^T X) for stability
  • Eckart-Young: truncated SVD is the best low-rank approximation
  • Reconstruction error = sum of discarded eigenvalues
  • PCA requires centering; SVD alone does not center

Exercises

ExerciseCore

Problem

Show that the first principal component direction v1v_1 maximizes the projected variance 1ni(vTxi)2\frac{1}{n}\sum_i (v^T x_i)^2 subject to v=1\|v\| = 1. Use the method of Lagrange multipliers.

ExerciseAdvanced

Problem

Prove the kk-component version: if DD=IkD^\top D=I_k, then minimizing XXDDF2\lVert X-XDD^\top\rVert_F^2 is solved by taking the columns of DD to be the top kk eigenvectors of XXX^\top X.

ExerciseAdvanced

Problem

Explain why computing PCA via the eigendecomposition of XTXX^T X is numerically inferior to computing PCA via the SVD of XX. What goes wrong in floating-point arithmetic?

References

Canonical:

  • Jolliffe, Principal Component Analysis (2002), Chapters 1-3
  • Strang, Linear Algebra and Its Applications (4th ed.), Ch. 6 (Positive Definite Matrices, SVD)
  • Goodfellow, Bengio, Courville, Deep Learning (2016), Ch. 2 (Linear Algebra, PCA derivation)
  • Eckart, C. & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211-218.
  • Davis, C. & Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1), 1-46.

Current:

  • Shalev-Shwartz & Ben-David, Understanding Machine Learning (2014), Chapter 23
  • Vershynin, High-Dimensional Probability (2018), Chapter 4 (for matrix concentration and sample PCA)
  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning (2009), Ch. 14 (Unsupervised Learning)
  • Bishop, Pattern Recognition and Machine Learning (2006), Ch. 12 (Continuous Latent Variables)

High-Dimensional PCA:

  • Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29(2), 295-327. arXiv:math/0309281.
  • Baik, J., Ben Arous, G., & Peche, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, 33(5), 1643-1697. arXiv:math/0403022.

Beyond classical PCA (extension references):

  • Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B, 61(3), 611-622. The PPCA paper; gives PCA the generative-model interpretation and the EM algorithm for missing data.
  • Schölkopf, B., Smola, A., & Müller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299-1319. Kernel PCA introduction.
  • Zou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15(2), 265-286. Sparse PCA via 1\ell_1-penalized loadings.
  • Candès, E. J., Li, X., Ma, Y., & Wright, J. (2011). Robust principal component analysis? Journal of the ACM, 58(3), 1-37. Low-rank-plus-sparse decomposition; nuclear-norm relaxation.
  • Bartholomew, Knott, Moustaki, Latent Variable Models and Factor Analysis (3rd ed., 2011). The graduate textbook on FA and its relationship to PCA.

Next Topics

The natural next step from PCA:

Last reviewed: May 4, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

7

Derived topics

6

+1 more on the derived-topics page.