Part 3 · Core Linear AlgebraChapter 1185 min

Eigenvalues, Eigenvectors, and PCA Intuition

Invariant directions and the axes of variance

Learning objectives

  • Define eigenvectors/eigenvalues and read Av = λv geometrically
  • Explain covariance as a shape descriptor of data
  • Describe PCA as an eigen-decomposition of the covariance matrix
  • Reason about dimensionality reduction and information loss

When a matrix leaves a direction alone

A matrix is a transformation: feed it a vector and it hands you back another vector, generally rotated, stretched, and sheared into a new place. For almost every input direction the output points somewhere new. But for a handful of special directions the matrix does something remarkably tame — it leaves the direction exactly where it was and merely scales the arrow along that line.

Those special directions are the eigenvectors of the matrix, and the scale factors are its eigenvalues. They are the skeleton of the transformation: the axes along which it acts most simply. This one idea powers a surprising amount of machine learning — principal component analysis, spectral clustering, PageRank, the stability analysis of training dynamics, and the intuition behind why some directions in a model's weight space matter far more than others.

We will keep everything in two dimensions so you can compute it by hand and watch it on screen. The mechanics that scale this to hundreds of dimensions — the singular value decomposition — belong to Volume 2; here we build the intuition that makes that machinery feel inevitable.

Intuition: eigenvectors are the axes a matrix does not turn

Apply a matrix AA to a vector v\mathbf{v} and, in general, AvA\mathbf{v} points in a new direction. An eigenvector is a nonzero v\mathbf{v} for which the output lands right back on the same line through the origin:

Av=λv.A\mathbf{v} = \lambda\mathbf{v}.

Read that geometrically: transforming v\mathbf{v} produces a vector pointing the same way (or exactly opposite, if λ<0\lambda < 0), only longer or shorter by the factor λ\lambda. The direction is invariant; only the length changes. If λ=2\lambda = 2 the arrow doubles; if λ=0.5\lambda = 0.5 it halves; if λ=1\lambda = -1 it flips end-to-end but stays on its own line.

Interactive LabMatrix Transformation Visualizer
Loading interactive lab…

Drag the input vector around the circle and watch the output. For most angles the blue input and the red output diverge. But rotate the input onto an eigenvector and the two arrows snap onto the same line — the matrix stops turning it and only scales it. Those alignment angles are the eigenvectors; the length ratio at the alignment is the eigenvalue.

Now the leap that makes this an ML tool. Suppose the matrix is not an arbitrary transformation but the covariance matrix of a cloud of data. Its eigenvectors are then the directions along which the cloud is stretched, and its eigenvalues measure how much spread there is along each. The longest axis of the cloud — the direction of maximum variance — is the top eigenvector. That is PCA in one sentence, and it is worth seeing before we formalize anything.

Interactive LabPCA Intuition Visualizer
Loading interactive lab…

Rotate and stretch the data cloud. The first principal component (the long axis of the ellipse) always locks onto the direction of greatest spread, and the reported "explained variance" is exactly the eigenvalue along that axis. Keep this picture in mind — every definition below is just this drawing made precise.

Formal definitions

A 2×2 eigen-computation by hand

The characteristic equation is worth grinding through once so it stops being magic. Take the symmetric matrix

Form AλIA - \lambda I and set its determinant to zero:

This factors as (λ3)(λ1)=0(\lambda - 3)(\lambda - 1) = 0, so the eigenvalues are λ1=3\lambda_1 = 3 and λ2=1\lambda_2 = 1. A useful sanity check: the eigenvalues must sum to the trace (2+2=4=3+12 + 2 = 4 = 3 + 1) and multiply to the determinant (2211=3=312\cdot2 - 1\cdot1 = 3 = 3\cdot1).

Now find each eigenvector by solving (AλI)v=0(A - \lambda I)\mathbf{v} = \mathbf{0}.

Let NumPy confirm the hand computation. np.linalg.eig returns eigenvalues and eigenvectors (as the columns of the returned matrix):

eig_by_hand.py

Symmetric matrices: real eigenvalues, orthogonal eigenvectors

A general matrix can have complex eigenvalues (a pure rotation has none that are real — nothing stays on its own line). But symmetric matrices, where A=AA = A^\top, are guaranteed two beautiful properties:

  • Every eigenvalue is a real number.
  • Eigenvectors for distinct eigenvalues are orthogonal, and one can always choose a full orthonormal set of them.

This is the spectral theorem, and it is exactly why PCA works so cleanly: a covariance matrix is symmetric by construction (Σjk=Σkj\Sigma_{jk} = \Sigma_{kj}), so its principal components are real directions that are mutually perpendicular — a clean, non-redundant set of axes for the data.

The covariance matrix is a shape descriptor

Before eigenvectors, understand what Σ\Sigma itself says. For 2-D centered data its entries are

The diagonal tells you how far the cloud spreads along each raw axis. The off-diagonal tells you whether the features move together: positive covariance tilts the cloud up-and-to-the-right, negative tilts it the other way, and zero covariance means the cloud is axis-aligned. Two datasets can have identical diagonals but wildly different shapes depending on that off-diagonal term — it is the tilt of the ellipse. PCA's job is to find the axes of that tilted ellipse, which are precisely the eigenvectors of Σ\Sigma.

Why the top eigenvector maximizes variance

Here is the derivation that links "direction of maximum spread" to "top eigenvector." Project every centered point onto a candidate unit direction w\mathbf{w} (so the projected coordinate of point xi\mathbf{x}_i is wxi\mathbf{w}^\top\mathbf{x}_i). The variance of those projections is

PCA asks: which unit direction w\mathbf{w} makes this projected variance as large as possible? We maximize wΣw\mathbf{w}^\top\Sigma\mathbf{w} subject to w2=1\lVert\mathbf{w}\rVert^2 = 1. Introducing a Lagrange multiplier λ\lambda for the constraint and setting the derivative to zero gives

Σw=λw.\Sigma\mathbf{w} = \lambda\mathbf{w}.

That is the whole engine: principal components are the eigenvectors of the covariance matrix, ordered by eigenvalue, and each eigenvalue is the variance captured along its component.

ML use case: dimensionality reduction and beyond

Real feature vectors and embeddings live in hundreds or thousands of dimensions, but their intrinsic variation often occupies far fewer. PCA exploits this.

  • Dimensionality reduction. Keep only the top kk components and project onto them: Z=XcWkZ = X_c W_k, where WkW_k holds the top kk eigenvectors as columns. Each row of XX (dimension dd) becomes a row of ZZ (dimension kk). You have compressed the data while keeping the directions that carry the most variance.
  • Explained variance and information loss. The fraction of variance kept is the eigenvalue mass you retained, kKλk/jλj\sum_{k \le K}\lambda_k \big/ \sum_j \lambda_j. The variance you threw away is the discarded eigenvalue mass — a precise, honest measure of information loss. A scree plot of eigenvalues tells you where the spectrum flattens and extra components stop paying rent.
  • Visualizing embeddings. Projecting a 768-dimensional embedding set onto its top 2 components is the standard way to see structure — clusters of similar words or images — on a flat plot.
  • Whitening. Rotating data onto its principal axes and then rescaling each by 1/λk1/\sqrt{\lambda_k} produces features with identity covariance (unit variance, uncorrelated) — a common preprocessing step that can stabilize training.

PCA from scratch in NumPy

Now assemble the full pipeline — center, covariance, eigen-decompose, sort, project — on a small correlated 2-D dataset, and verify the two facts we derived: the projected variances equal the eigenvalues, and the ratios sum to one.

pca_from_scratch.py

The recovered PC1 direction is (up to sign) the unit vector along (2,1)(2, 1) — the axis we built the data around — and over 90% of the variance lives there, so collapsing to one dimension loses very little. That is the payoff: a principled, measurable trade between compression and fidelity.

Research-paper equation practice

Decode these two equations with the full nine-step drill before revealing the solution. Together they are the entire foundation of this chapter.

Summary

  • An eigenvector v\mathbf{v} of AA satisfies Av=λvA\mathbf{v} = \lambda\mathbf{v}: the matrix leaves its direction invariant and only scales it by the eigenvalue λ\lambda. Eigenvectors are lines (scale-invariant) and carry a sign ambiguity.
  • Eigenvalues are the roots of the characteristic equation det(AλI)=0\det(A - \lambda I) = 0. For a 2×22\times2 matrix this is a quadratic whose roots sum to the trace and multiply to the determinant.
  • Symmetric matrices have real eigenvalues and orthogonal eigenvectors (the spectral theorem) — the reason PCA yields clean, perpendicular axes.
  • The covariance matrix Σ=1nXcXc\Sigma = \frac{1}{n}X_c^\top X_c describes a data cloud's shape: diagonal = variances, off-diagonal = correlations.
  • PCA eigen-decomposes Σ\Sigma. Principal components are its eigenvectors (directions of maximum variance); eigenvalues are the variance explained. Reducing to top-kk components projects onto them; the discarded eigenvalue mass is the information lost.

Active recall

Answer from memory before checking the lesson:

  1. State the defining equation of an eigenpair and describe it geometrically in one sentence.
  2. Write the characteristic equation and use it to find the eigenvalues of [2112]\begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}.
  3. Why are the principal components of any dataset guaranteed to be orthogonal?
  4. In PCA, what quantity equals the variance explained by a component, and what does the discarded portion of that quantity measure?
  5. Name the single preprocessing step you must never skip before computing the covariance matrix, and say what goes wrong if you do.

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge