Eigenvalues, Eigenvectors, and PCA Intuition
Invariant directions and the axes of variance
Learning objectives
- Define eigenvectors/eigenvalues and read Av = λv geometrically
- Explain covariance as a shape descriptor of data
- Describe PCA as an eigen-decomposition of the covariance matrix
- Reason about dimensionality reduction and information loss
When a matrix leaves a direction alone
A matrix is a transformation: feed it a vector and it hands you back another vector, generally rotated, stretched, and sheared into a new place. For almost every input direction the output points somewhere new. But for a handful of special directions the matrix does something remarkably tame — it leaves the direction exactly where it was and merely scales the arrow along that line.
Those special directions are the eigenvectors of the matrix, and the scale factors are its eigenvalues. They are the skeleton of the transformation: the axes along which it acts most simply. This one idea powers a surprising amount of machine learning — principal component analysis, spectral clustering, PageRank, the stability analysis of training dynamics, and the intuition behind why some directions in a model's weight space matter far more than others.
We will keep everything in two dimensions so you can compute it by hand and watch it on screen. The mechanics that scale this to hundreds of dimensions — the singular value decomposition — belong to Volume 2; here we build the intuition that makes that machinery feel inevitable.
Intuition: eigenvectors are the axes a matrix does not turn
Apply a matrix to a vector and, in general, points in a new direction. An eigenvector is a nonzero for which the output lands right back on the same line through the origin:
Read that geometrically: transforming produces a vector pointing the same way (or exactly opposite, if ), only longer or shorter by the factor . The direction is invariant; only the length changes. If the arrow doubles; if it halves; if it flips end-to-end but stays on its own line.
Drag the input vector around the circle and watch the output. For most angles the blue input and the red output diverge. But rotate the input onto an eigenvector and the two arrows snap onto the same line — the matrix stops turning it and only scales it. Those alignment angles are the eigenvectors; the length ratio at the alignment is the eigenvalue.
Now the leap that makes this an ML tool. Suppose the matrix is not an arbitrary transformation but the covariance matrix of a cloud of data. Its eigenvectors are then the directions along which the cloud is stretched, and its eigenvalues measure how much spread there is along each. The longest axis of the cloud — the direction of maximum variance — is the top eigenvector. That is PCA in one sentence, and it is worth seeing before we formalize anything.
Rotate and stretch the data cloud. The first principal component (the long axis of the ellipse) always locks onto the direction of greatest spread, and the reported "explained variance" is exactly the eigenvalue along that axis. Keep this picture in mind — every definition below is just this drawing made precise.
Formal definitions
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| A square matrix (a linear transformation) | matrix | n×n | fixed | |
| An eigenvector (invariant direction) | vector | n×1 | variable | |
| Eigenvalue (scale factor along v) | scalar | 1 | variable | |
| Identity matrix | matrix | n×n | fixed | |
| Data matrix (one sample per row) | matrix | n×d | fixed | |
| Centered data (column means removed) | matrix | n×d | derived | |
| Covariance matrix (shape of the cloud) | matrix | d×d | derived | |
| Variance explained by component k | scalar | 1 | derived |
A 2×2 eigen-computation by hand
The characteristic equation is worth grinding through once so it stops being magic. Take the symmetric matrix
Form and set its determinant to zero:
This factors as , so the eigenvalues are and . A useful sanity check: the eigenvalues must sum to the trace () and multiply to the determinant ().
Now find each eigenvector by solving .
Let NumPy confirm the hand computation. np.linalg.eig returns eigenvalues and
eigenvectors (as the columns of the returned matrix):
Symmetric matrices: real eigenvalues, orthogonal eigenvectors
A general matrix can have complex eigenvalues (a pure rotation has none that are real — nothing stays on its own line). But symmetric matrices, where , are guaranteed two beautiful properties:
- Every eigenvalue is a real number.
- Eigenvectors for distinct eigenvalues are orthogonal, and one can always choose a full orthonormal set of them.
This is the spectral theorem, and it is exactly why PCA works so cleanly: a covariance matrix is symmetric by construction (), so its principal components are real directions that are mutually perpendicular — a clean, non-redundant set of axes for the data.
The covariance matrix is a shape descriptor
Before eigenvectors, understand what itself says. For 2-D centered data its entries are
The diagonal tells you how far the cloud spreads along each raw axis. The off-diagonal tells you whether the features move together: positive covariance tilts the cloud up-and-to-the-right, negative tilts it the other way, and zero covariance means the cloud is axis-aligned. Two datasets can have identical diagonals but wildly different shapes depending on that off-diagonal term — it is the tilt of the ellipse. PCA's job is to find the axes of that tilted ellipse, which are precisely the eigenvectors of .
Why the top eigenvector maximizes variance
Here is the derivation that links "direction of maximum spread" to "top eigenvector." Project every centered point onto a candidate unit direction (so the projected coordinate of point is ). The variance of those projections is
PCA asks: which unit direction makes this projected variance as large as possible? We maximize subject to . Introducing a Lagrange multiplier for the constraint and setting the derivative to zero gives
That is the whole engine: principal components are the eigenvectors of the covariance matrix, ordered by eigenvalue, and each eigenvalue is the variance captured along its component.
ML use case: dimensionality reduction and beyond
Real feature vectors and embeddings live in hundreds or thousands of dimensions, but their intrinsic variation often occupies far fewer. PCA exploits this.
- Dimensionality reduction. Keep only the top components and project onto them: , where holds the top eigenvectors as columns. Each row of (dimension ) becomes a row of (dimension ). You have compressed the data while keeping the directions that carry the most variance.
- Explained variance and information loss. The fraction of variance kept is the eigenvalue mass you retained, . The variance you threw away is the discarded eigenvalue mass — a precise, honest measure of information loss. A scree plot of eigenvalues tells you where the spectrum flattens and extra components stop paying rent.
- Visualizing embeddings. Projecting a 768-dimensional embedding set onto its top 2 components is the standard way to see structure — clusters of similar words or images — on a flat plot.
- Whitening. Rotating data onto its principal axes and then rescaling each by produces features with identity covariance (unit variance, uncorrelated) — a common preprocessing step that can stabilize training.
PCA from scratch in NumPy
Now assemble the full pipeline — center, covariance, eigen-decompose, sort, project — on a small correlated 2-D dataset, and verify the two facts we derived: the projected variances equal the eigenvalues, and the ratios sum to one.
The recovered PC1 direction is (up to sign) the unit vector along — the axis we built the data around — and over 90% of the variance lives there, so collapsing to one dimension loses very little. That is the payoff: a principled, measurable trade between compression and fidelity.
Research-paper equation practice
Decode these two equations with the full nine-step drill before revealing the solution. Together they are the entire foundation of this chapter.
The eigenvalue equation
The definition of an eigenpair. It appears wherever a matrix has invariant directions — PCA, spectral methods, stability analysis.
Work through these steps:
- Identify every symbol.
- State the type of every object (scalar, vector, matrix, index, set, function).
- State the dimensions / shapes.
- Rewrite the equation in plain English.
- Expand it for a tiny concrete example.
- Identify the assumptions.
- Convert it to pseudocode.
- Implement it in NumPy.
- Explain its machine-learning purpose.
The covariance matrix
The shape descriptor of a centered dataset, and the matrix PCA eigen-decomposes. X_c holds one centered sample per row.
Work through these steps:
- Identify every symbol.
- State the type of every object (scalar, vector, matrix, index, set, function).
- State the dimensions / shapes.
- Rewrite the equation in plain English.
- Expand it for a tiny concrete example.
- Identify the assumptions.
- Convert it to pseudocode.
- Implement it in NumPy.
- Explain its machine-learning purpose.
Summary
- An eigenvector of satisfies : the matrix leaves its direction invariant and only scales it by the eigenvalue . Eigenvectors are lines (scale-invariant) and carry a sign ambiguity.
- Eigenvalues are the roots of the characteristic equation . For a matrix this is a quadratic whose roots sum to the trace and multiply to the determinant.
- Symmetric matrices have real eigenvalues and orthogonal eigenvectors (the spectral theorem) — the reason PCA yields clean, perpendicular axes.
- The covariance matrix describes a data cloud's shape: diagonal = variances, off-diagonal = correlations.
- PCA eigen-decomposes . Principal components are its eigenvectors (directions of maximum variance); eigenvalues are the variance explained. Reducing to top- components projects onto them; the discarded eigenvalue mass is the information lost.
Active recall
Answer from memory before checking the lesson:
- State the defining equation of an eigenpair and describe it geometrically in one sentence.
- Write the characteristic equation and use it to find the eigenvalues of .
- Why are the principal components of any dataset guaranteed to be orthogonal?
- In PCA, what quantity equals the variance explained by a component, and what does the discarded portion of that quantity measure?
- Name the single preprocessing step you must never skip before computing the covariance matrix, and say what goes wrong if you do.
Exercises
Level ARecall & basic calculation
Verify an eigenvector
Let and . Compute ; it should equal for some scalar. Enter the eigenvalue .
Eigenvalues of a triangular matrix
Give the larger eigenvalue of the upper-triangular matrix .
Sum of eigenvalues = trace
Without solving for them individually, give the sum of the two eigenvalues of .
Explained variance ratio
A 2-D PCA yields covariance eigenvalues and . What fraction of the total variance is explained by the first principal component?
Eigenvectors of a symmetric matrix
A covariance matrix is symmetric. What is guaranteed about the eigenvectors belonging to its distinct eigenvalues?
What an eigenvalue means in PCA
In PCA, the eigenvalue of the covariance matrix (for principal component ) equals which quantity?
Level BConceptual understanding
Geometric meaning of an eigenvector
Which statement best describes an eigenvector of a matrix geometrically?
Why center before PCA?
Explain in a sentence or two why the data must be centered (mean subtracted) before forming the covariance matrix for PCA. What can the first component pick up if you forget?
Eigenvector sign ambiguity
You run the same PCA twice. The first principal component returns as one time and the next. Which explanation is correct?
Shape of the covariance matrix
Your data matrix has shape — 1000 samples, 50 features. What is the shape of the covariance matrix ?
Zero covariance vs independence
The off-diagonal entry of a covariance matrix is . Explain what this does and does not guarantee about the two features, and give a concrete example where covariance is zero yet the features are perfectly dependent.
Level CDerivation & implementation
Eigenvalues from the characteristic equation
Use the characteristic equation to find both eigenvalues of . Enter them as larger, smaller.
Implement PCA projection from scratch
Write pca_project(X, k) that centers X (shape (n, d)), forms the covariance , eigen-decomposes it with np.linalg.eigh, sorts components by descending eigenvalue, and returns the projection of the centered data onto the top k components (shape (n, k)). Test on a correlated 2-D dataset with k=1, assert the projected variance equals the top eigenvalue, and print ok.
Derive: top eigenvector maximizes variance
Show that the unit direction maximizing the projected variance subject to is an eigenvector of , and that the maximum value equals the largest eigenvalue.
Explained-variance ratio in NumPy
Given eigenvalues of a covariance matrix, write code that computes the cumulative explained-variance ratio and returns the smallest number of components needed to retain at least 90% of the variance. Test on eigenvalues [10.0, 4.0, 1.0, 0.5, 0.5], assert the answer is 3, and print ok.
Level DResearch-thinking challenge
When does PCA fail?
PCA finds the best linear subspace. Describe a concrete dataset whose true structure is 1-dimensional but which PCA cannot compress to one component without large error, explain geometrically why PCA fails, and name one family of methods designed to handle it.
Max variance is not always max usefulness
PCA keeps the directions of largest variance. Argue, with a concrete scenario, why the highest-variance direction can be the wrong thing to keep for a downstream classification task, and contrast PCA's objective with what a supervised method (e.g. LDA) optimizes instead.