Eigenvalues, Eigenvectors, and PCA Intuition

When a matrix leaves a direction alone

A matrix is a transformation: feed it a vector and it hands you back another vector, generally rotated, stretched, and sheared into a new place. For almost every input direction the output points somewhere new. But for a handful of special directions the matrix does something remarkably tame — it leaves the direction exactly where it was and merely scales the arrow along that line.

Those special directions are the eigenvectors of the matrix, and the scale factors are its eigenvalues. They are the skeleton of the transformation: the axes along which it acts most simply. This one idea powers a surprising amount of machine learning — principal component analysis, spectral clustering, PageRank, the stability analysis of training dynamics, and the intuition behind why some directions in a model's weight space matter far more than others.

We will keep everything in two dimensions so you can compute it by hand and watch it on screen. The mechanics that scale this to hundreds of dimensions — the singular value decomposition — belong to Volume 2; here we build the intuition that makes that machinery feel inevitable.

Intuition: eigenvectors are the axes a matrix does not turn

Apply a matrix $A$ to a vector $\mathbf{v}$ and, in general, $A\mathbf{v}$ points in a new direction. An eigenvector is a nonzero $\mathbf{v}$ for which the output lands right back on the same line through the origin:

$A\mathbf{v} = \lambda\mathbf{v}.$

Read that geometrically: transforming $\mathbf{v}$ produces a vector pointing the same way (or exactly opposite, if $\lambda < 0$ ), only longer or shorter by the factor $\lambda$ . The direction is invariant; only the length changes. If $\lambda = 2$ the arrow doubles; if $\lambda = 0.5$ it halves; if $\lambda = -1$ it flips end-to-end but stays on its own line.

Interactive LabMatrix Transformation Visualizer

Loading interactive lab…

Drag the input vector around the circle and watch the output. For most angles the blue input and the red output diverge. But rotate the input onto an eigenvector and the two arrows snap onto the same line — the matrix stops turning it and only scales it. Those alignment angles are the eigenvectors; the length ratio at the alignment is the eigenvalue.

Now the leap that makes this an ML tool. Suppose the matrix is not an arbitrary transformation but the covariance matrix of a cloud of data. Its eigenvectors are then the directions along which the cloud is stretched, and its eigenvalues measure how much spread there is along each. The longest axis of the cloud — the direction of maximum variance — is the top eigenvector. That is PCA in one sentence, and it is worth seeing before we formalize anything.

Interactive LabPCA Intuition Visualizer

Loading interactive lab…

Rotate and stretch the data cloud. The first principal component (the long axis of the ellipse) always locks onto the direction of greatest spread, and the reported "explained variance" is exactly the eigenvalue along that axis. Keep this picture in mind — every definition below is just this drawing made precise.

Formal definitions

Symbol	Meaning	Type	Shape	Role
$A$	A square matrix (a linear transformation)	matrix	n×n	fixed
$\mathbf{v}$	An eigenvector (invariant direction)	vector	n×1	variable
$\lambda$	Eigenvalue (scale factor along v)	scalar	1	variable
$I$	Identity matrix	matrix	n×n	fixed
$X$	Data matrix (one sample per row)	matrix	n×d	fixed
$X_c$	Centered data (column means removed)	matrix	n×d	derived
$\Sigma$	Covariance matrix (shape of the cloud)	matrix	d×d	derived
$\lambda_k$	Variance explained by component k	scalar	1	derived

A 2×2 eigen-computation by hand

The characteristic equation is worth grinding through once so it stops being magic. Take the symmetric matrix

A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}.

(11.1)

Form $A - \lambda I$ and set its determinant to zero:

\det\!\begin{bmatrix} 2-\lambda & 1 \\ 1 & 2-\lambda \end{bmatrix} = (2-\lambda)^2 - (1)(1) = \lambda^2 - 4\lambda + 3 = 0.

(11.2)

This factors as $(\lambda - 3)(\lambda - 1) = 0$ , so the eigenvalues are $\lambda_1 = 3$ and $\lambda_2 = 1$ . A useful sanity check: the eigenvalues must sum to the trace ( $2 + 2 = 4 = 3 + 1$ ) and multiply to the determinant ( $2\cdot2 - 1\cdot1 = 3 = 3\cdot1$ ).

Now find each eigenvector by solving $(A - \lambda I)\mathbf{v} = \mathbf{0}$ .

Worked Example — eigenvectors of A

For $\lambda_1 = 3$ : $A - 3I = \begin{bmatrix} -1 & 1 \\ 1 & -1 \end{bmatrix}$ . The equation $-v_1 + v_2 = 0$ forces $v_1 = v_2$ , so $\mathbf{v}_1 \propto (1, 1)$ . As a unit vector, $\mathbf{v}_1 = \tfrac{1}{\sqrt{2}}(1, 1)$ .

For $\lambda_2 = 1$ : $A - I = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}$ . The equation $v_1 + v_2 = 0$ forces $v_1 = -v_2$ , so $\mathbf{v}_2 \propto (1, -1)$ , i.e. $\mathbf{v}_2 = \tfrac{1}{\sqrt{2}}(1, -1)$ .

Check: $A\mathbf{v}_1 = (3, 3) = 3\,(1,1)$ and the two eigenvectors are orthogonal: $(1,1)\cdot(1,-1) = 1 - 1 = 0$ . That orthogonality is not luck — it is guaranteed because $A$ is symmetric.

Let NumPy confirm the hand computation. np.linalg.eig returns eigenvalues and eigenvectors (as the columns of the returned matrix):

eig_by_hand.py

import numpy as np

A = np.array([[2.0, 1.0],
            [1.0, 2.0]])

# eig returns (eigenvalues, matrix whose COLUMNS are eigenvectors)
evals, evecs = np.linalg.eig(A)

# Put the largest eigenvalue first for a stable ordering
order = np.argsort(evals)[::-1]
evals = evals[order]
evecs = evecs[:, order]

print("eigenvalues:", np.round(evals, 3))   # [3. 1.]

# Verify the defining relation A v = lambda v for the top eigenpair
v = evecs[:, 0]
assert np.allclose(A @ v, evals[0] * v), "A v must equal lambda v"

# Symmetric matrix -> eigenvectors are orthogonal
assert np.isclose(evecs[:, 0] @ evecs[:, 1], 0.0), "must be orthogonal"

print("ok")

Symmetric matrices: real eigenvalues, orthogonal eigenvectors

A general matrix can have complex eigenvalues (a pure rotation has none that are real — nothing stays on its own line). But symmetric matrices, where $A = A^\top$ , are guaranteed two beautiful properties:

Every eigenvalue is a real number.
Eigenvectors for distinct eigenvalues are orthogonal, and one can always choose a full orthonormal set of them.

This is the spectral theorem, and it is exactly why PCA works so cleanly: a covariance matrix is symmetric by construction ( $\Sigma_{jk} = \Sigma_{kj}$ ), so its principal components are real directions that are mutually perpendicular — a clean, non-redundant set of axes for the data.

The covariance matrix is a shape descriptor

Before eigenvectors, understand what $\Sigma$ itself says. For 2-D centered data its entries are

\Sigma = \begin{bmatrix} \operatorname{Var}(x) & \operatorname{Cov}(x,y) \\ \operatorname{Cov}(x,y) & \operatorname{Var}(y) \end{bmatrix}.

(11.3)

The diagonal tells you how far the cloud spreads along each raw axis. The off-diagonal tells you whether the features move together: positive covariance tilts the cloud up-and-to-the-right, negative tilts it the other way, and zero covariance means the cloud is axis-aligned. Two datasets can have identical diagonals but wildly different shapes depending on that off-diagonal term — it is the tilt of the ellipse. PCA's job is to find the axes of that tilted ellipse, which are precisely the eigenvectors of $\Sigma$ .

Why the top eigenvector maximizes variance

Here is the derivation that links "direction of maximum spread" to "top eigenvector." Project every centered point onto a candidate unit direction $\mathbf{w}$ (so the projected coordinate of point $\mathbf{x}_i$ is $\mathbf{w}^\top\mathbf{x}_i$ ). The variance of those projections is

\operatorname{Var}(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^{n} (\mathbf{w}^\top \mathbf{x}_i)^2 = \mathbf{w}^\top \Sigma\, \mathbf{w}.

(11.4)

PCA asks: which unit direction $\mathbf{w}$ makes this projected variance as large as possible? We maximize $\mathbf{w}^\top\Sigma\mathbf{w}$ subject to $\lVert\mathbf{w}\rVert^2 = 1$ . Introducing a Lagrange multiplier $\lambda$ for the constraint and setting the derivative to zero gives

$\Sigma\mathbf{w} = \lambda\mathbf{w}.$

That is the whole engine: principal components are the eigenvectors of the covariance matrix, ordered by eigenvalue, and each eigenvalue is the variance captured along its component.

ML use case: dimensionality reduction and beyond

Real feature vectors and embeddings live in hundreds or thousands of dimensions, but their intrinsic variation often occupies far fewer. PCA exploits this.

Dimensionality reduction. Keep only the top $k$ components and project onto them: $Z = X_c W_k$ , where $W_k$ holds the top $k$ eigenvectors as columns. Each row of $X$ (dimension $d$ ) becomes a row of $Z$ (dimension $k$ ). You have compressed the data while keeping the directions that carry the most variance.
Explained variance and information loss. The fraction of variance kept is the eigenvalue mass you retained, $\sum_{k \le K}\lambda_k \big/ \sum_j \lambda_j$ . The variance you threw away is the discarded eigenvalue mass — a precise, honest measure of information loss. A scree plot of eigenvalues tells you where the spectrum flattens and extra components stop paying rent.
Visualizing embeddings. Projecting a 768-dimensional embedding set onto its top 2 components is the standard way to see structure — clusters of similar words or images — on a flat plot.
Whitening. Rotating data onto its principal axes and then rescaling each by $1/\sqrt{\lambda_k}$ produces features with identity covariance (unit variance, uncorrelated) — a common preprocessing step that can stabilize training.

PCA from scratch in NumPy

Now assemble the full pipeline — center, covariance, eigen-decompose, sort, project — on a small correlated 2-D dataset, and verify the two facts we derived: the projected variances equal the eigenvalues, and the ratios sum to one.

pca_from_scratch.py

import numpy as np

rng = np.random.default_rng(0)

# --- A small 2-D dataset with strong linear correlation ---
n = 300
a = rng.standard_normal(n)                 # a hidden 1-D factor
X = np.column_stack([2.0 * a, 1.0 * a])    # points lie along direction (2, 1)
X = X + 0.3 * rng.standard_normal((n, 2))  # add small isotropic noise
# X has shape (n, 2): each row is one 2-D sample.

# --- Step 1: CENTER (subtract the per-column mean) ---
mu = X.mean(axis=0)                         # shape (2,)
Xc = X - mu                                 # broadcast subtract -> each feature mean 0

# --- Step 2: covariance matrix (population form, divide by n) ---
cov = (Xc.T @ Xc) / n                       # shape (2, 2), symmetric
assert np.allclose(cov, np.cov(X.T, bias=True))   # matches np.cov(bias=True)

# --- Step 3: eigen-decomposition of the SYMMETRIC covariance ---
evals, evecs = np.linalg.eigh(cov)          # eigh: real evals, orthonormal columns

# --- Step 4: sort into DESCENDING order (top variance first) ---
order = np.argsort(evals)[::-1]
evals = evals[order]
evecs = evecs[:, order]                     # column k = k-th principal component

# --- Step 5: PROJECT centered data onto the components ---
scores = Xc @ evecs                         # shape (n, 2): coordinates in PC space

# Verify: eigenvalues ARE the variances along each component
assert np.allclose(scores.var(axis=0), evals)   # ddof=0 matches the /n covariance

# Explained variance ratio = eigenvalue mass fraction
ratio = evals / evals.sum()
assert np.isclose(ratio.sum(), 1.0)
assert ratio[0] > 0.9                        # nearly all spread is along PC1

# --- Dimensionality reduction: keep PC1 only; loss = discarded eigenvalue mass ---
info_loss = evals[1] / evals.sum()
print("PC1 direction :", np.round(evecs[:, 0], 3))   # ~ (2,1) normalized, up to sign
print("variance ratio:", np.round(ratio, 3))
print("info loss     :", round(float(info_loss), 3))
print("ok")

The recovered PC1 direction is (up to sign) the unit vector along $(2, 1)$ — the axis we built the data around — and over 90% of the variance lives there, so collapsing to one dimension loses very little. That is the payoff: a principled, measurable trade between compression and fidelity.

Three traps that quietly break PCA

Center first. PCA assumes zero-mean data; the formula $\Sigma = \frac{1}{n} X_c^\top X_c$ is only a covariance after the column means are removed. Skip the centering and the "first component" often just points at the mean, not the spread.

Eigenvectors have a sign (and scale) ambiguity. If $\mathbf{v}$ is an eigenvector so is $-\mathbf{v}$ , so a principal component may come back flipped between runs, libraries, or platforms. Never attach meaning to the raw sign; compare directions up to sign, and fix a convention if you need reproducibility.

Correlation is not independence. A zero off-diagonal in $\Sigma$ means the features are linearly uncorrelated, not statistically independent — a ring or a parabola can have zero covariance while being perfectly dependent. PCA sees only the second-order (linear) shape of the data.

Research-paper equation practice

Decode these two equations with the full nine-step drill before revealing the solution. Together they are the entire foundation of this chapter.

Research Paper Equation Practice

The eigenvalue equation

The definition of an eigenpair. It appears wherever a matrix has invariant directions — PCA, spectral methods, stability analysis.

A\mathbf{v} = \lambda\mathbf{v}, \qquad \mathbf{v} \neq \mathbf{0}

Work through these steps:

Identify every symbol.
State the type of every object (scalar, vector, matrix, index, set, function).
State the dimensions / shapes.
Rewrite the equation in plain English.
Expand it for a tiny concrete example.
Identify the assumptions.
Convert it to pseudocode.
Implement it in NumPy.
Explain its machine-learning purpose.

Research Paper Equation Practice

The covariance matrix

The shape descriptor of a centered dataset, and the matrix PCA eigen-decomposes. X_c holds one centered sample per row.

\Sigma = \frac{1}{n} X_c^{\top} X_c

Work through these steps:

Identify every symbol.
State the type of every object (scalar, vector, matrix, index, set, function).
State the dimensions / shapes.
Rewrite the equation in plain English.
Expand it for a tiny concrete example.
Identify the assumptions.
Convert it to pseudocode.
Implement it in NumPy.
Explain its machine-learning purpose.

Summary

An eigenvector $\mathbf{v}$ of $A$ satisfies $A\mathbf{v} = \lambda\mathbf{v}$ : the matrix leaves its direction invariant and only scales it by the eigenvalue $\lambda$ . Eigenvectors are lines (scale-invariant) and carry a sign ambiguity.
Eigenvalues are the roots of the characteristic equation $\det(A - \lambda I) = 0$ . For a $2\times2$ matrix this is a quadratic whose roots sum to the trace and multiply to the determinant.
Symmetric matrices have real eigenvalues and orthogonal eigenvectors (the spectral theorem) — the reason PCA yields clean, perpendicular axes.
The covariance matrix $\Sigma = \frac{1}{n}X_c^\top X_c$ describes a data cloud's shape: diagonal = variances, off-diagonal = correlations.
PCA eigen-decomposes $\Sigma$ . Principal components are its eigenvectors (directions of maximum variance); eigenvalues are the variance explained. Reducing to top- $k$ components projects onto them; the discarded eigenvalue mass is the information lost.

Active recall

Answer from memory before checking the lesson:

State the defining equation of an eigenpair and describe it geometrically in one sentence.
Write the characteristic equation and use it to find the eigenvalues of $\begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}$ .
Why are the principal components of any dataset guaranteed to be orthogonal?
In PCA, what quantity equals the variance explained by a component, and what does the discarded portion of that quantity measure?
Name the single preprocessing step you must never skip before computing the covariance matrix, and say what goes wrong if you do.

Exercises

Level ARecall & basic calculation

Level AHand calculationch11-A1

Verify an eigenvector

Let $A = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}$ and $\mathbf{v} = (1, 1)$ . Compute $A\mathbf{v}$ ; it should equal $\lambda\mathbf{v}$ for some scalar. Enter the eigenvalue $\lambda$ .

Level AHand calculationch11-A2

Eigenvalues of a triangular matrix

Give the larger eigenvalue of the upper-triangular matrix $A = \begin{bmatrix} 5 & 2 \\ 0 & 1 \end{bmatrix}$ .

Level AHand calculationch11-A3

Sum of eigenvalues = trace

Without solving for them individually, give the sum of the two eigenvalues of $A = \begin{bmatrix} 6 & 2 \\ 2 & 3 \end{bmatrix}$ .

Level AHand calculationch11-A4

Explained variance ratio

A 2-D PCA yields covariance eigenvalues $\lambda_1 = 8$ and $\lambda_2 = 2$ . What fraction of the total variance is explained by the first principal component?

Level AEquation interpretationch11-A5

Eigenvectors of a symmetric matrix

A covariance matrix is symmetric. What is guaranteed about the eigenvectors belonging to its distinct eigenvalues?

Level AEquation interpretationch11-A6

What an eigenvalue means in PCA

In PCA, the eigenvalue $\lambda_k$ of the covariance matrix (for principal component $k$ ) equals which quantity?

Level BConceptual understanding

Level BEquation interpretationch11-B1

Geometric meaning of an eigenvector

Which statement best describes an eigenvector $\mathbf{v}$ of a matrix $A$ geometrically?

Level BML applicationch11-B2

Why center before PCA?

Explain in a sentence or two why the data must be centered (mean subtracted) before forming the covariance matrix for PCA. What can the first component pick up if you forget?

Level BError identificationch11-B3

Eigenvector sign ambiguity

You run the same PCA twice. The first principal component returns as $(0.71, 0.71)$ one time and $(-0.71, -0.71)$ the next. Which explanation is correct?

Level BShape reasoningch11-B4

Shape of the covariance matrix

Your data matrix $X$ has shape $(n, d) = (1000, 50)$ — 1000 samples, 50 features. What is the shape of the covariance matrix $\Sigma = \frac{1}{n} X_c^\top X_c$ ?

Level BML applicationch11-B5

Zero covariance vs independence

The off-diagonal entry of a covariance matrix is $0$ . Explain what this does and does not guarantee about the two features, and give a concrete example where covariance is zero yet the features are perfectly dependent.

Level CDerivation & implementation

Level CHand calculationch11-C1

Eigenvalues from the characteristic equation

Use the characteristic equation $\det(A - \lambda I) = 0$ to find both eigenvalues of $A = \begin{bmatrix} 4 & 2 \\ 1 & 3 \end{bmatrix}$ . Enter them as larger, smaller.

Level CNumPy implementationch11-C2

Implement PCA projection from scratch

Write pca_project(X, k) that centers X (shape (n, d)), forms the covariance $\frac{1}{n} X_c^\top X_c$ , eigen-decomposes it with np.linalg.eigh, sorts components by descending eigenvalue, and returns the projection of the centered data onto the top k components (shape (n, k)). Test on a correlated 2-D dataset with k=1, assert the projected variance equals the top eigenvalue, and print ok.

Level CDerivationch11-C3

Derive: top eigenvector maximizes variance

Show that the unit direction $\mathbf{w}$ maximizing the projected variance $\mathbf{w}^\top \Sigma \mathbf{w}$ subject to $\lVert\mathbf{w}\rVert^2 = 1$ is an eigenvector of $\Sigma$ , and that the maximum value equals the largest eigenvalue.

Level CNumPy implementationch11-C4

Explained-variance ratio in NumPy

Given eigenvalues of a covariance matrix, write code that computes the cumulative explained-variance ratio and returns the smallest number of components needed to retain at least 90% of the variance. Test on eigenvalues [10.0, 4.0, 1.0, 0.5, 0.5], assert the answer is 3, and print ok.

Level DResearch-thinking challenge

Level DPaper-reading practicech11-D1

When does PCA fail?

PCA finds the best linear subspace. Describe a concrete dataset whose true structure is 1-dimensional but which PCA cannot compress to one component without large error, explain geometrically why PCA fails, and name one family of methods designed to handle it.

Level DPaper-reading practicech11-D2

Max variance is not always max usefulness

PCA keeps the directions of largest variance. Argue, with a concrete scenario, why the highest-variance direction can be the wrong thing to keep for a downstream classification task, and contrast PCA's objective with what a supervised method (e.g. LDA) optimizes instead.