Part 3 · Core Linear AlgebraChapter 1080 min

Vector Spaces and Linear Transformations

Span, independence, basis, rank, and the four subspaces

Learning objectives

  • Define span, linear independence, basis, and dimension
  • Read a matrix as a linear transformation of space
  • Reason about column space, null space, and rank
  • Connect rank to information preserved by a layer

Why "where the basis lands" is the whole game

By now a matrix is, for you, a grid of numbers you can multiply against a vector. That view is correct but inert. The view that makes matrices click — the one that turns linear algebra from bookkeeping into a spatial intuition you can reason with — is this: a matrix is a linear transformation of space. It picks up every point, stretches and rotates and shears the whole grid, and sets it back down. And a transformation that acts on infinitely many points is fully described by a tiny amount of data: where the basis vectors land.

That single reframing answers questions that matter in ML. When a weight matrix maps a 4096-dimensional hidden state into a 4096-dimensional output, how much information can it actually carry? When we replace a big matrix with the product of two skinny ones — the trick behind LoRA adapters and embedding tables — what exactly are we giving up? The vocabulary for all of this is span, basis, rank, and null space, and every one of them is a statement about which directions a transformation keeps and which it destroys.

Intuition: a transformation is pinned down by where the basis lands

Start in the plane with the two standard basis vectors e1=(1,0)\mathbf{e}_1 = (1, 0) and e2=(0,1)\mathbf{e}_2 = (0, 1). Every other vector is a linear combination of them: (3,1)=3e1+1e2(3, 1) = 3\mathbf{e}_1 + 1\mathbf{e}_2. Now here is the defining property of a linear transformation TT — it commutes with addition and scaling:

T(cu+dv)=cT(u)+dT(v).T(c\,\mathbf{u} + d\,\mathbf{v}) = c\,T(\mathbf{u}) + d\,T(\mathbf{v}).

Feed (3,1)(3,1) through it and the property lets you pull the transformation apart:

T(3e1+e2)=3T(e1)+T(e2).T(3\mathbf{e}_1 + \mathbf{e}_2) = 3\,T(\mathbf{e}_1) + T(\mathbf{e}_2).

Read that again. To know what TT does to any vector, you only need to know T(e1)T(\mathbf{e}_1) and T(e2)T(\mathbf{e}_2) — the coordinates of the same vector follow along for the ride. So if I tell you e1\mathbf{e}_1 lands on (2,1)(2, 1) and e2\mathbf{e}_2 lands on (1,1)(-1, 1), you can transform anything.

And where do we store those two landing spots? As the columns of a matrix:

A=[2111],Ae1=[21],Ae2=[11].A = \begin{bmatrix} 2 & -1 \\ 1 & 1 \end{bmatrix}, \qquad A\mathbf{e}_1 = \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \qquad A\mathbf{e}_2 = \begin{bmatrix} -1 \\ 1 \end{bmatrix}.

That is the entire secret of matrix–vector multiplication: the columns of AA are the images of the basis vectors, and AxA\mathbf{x} just recombines those columns using x\mathbf{x}'s coordinates as the weights. A matrix is a table of "where the basis lands."

Interactive LabMatrix Transformation Visualizer
Loading interactive lab…

Drag the columns above and watch the whole grid deform with them. Try to make the two columns point along the same line — the grid collapses from a 2-D sheet onto a 1-D line, and a whole dimension of space is crushed to nothing. Hold that picture; it is exactly what rank deficiency means, and we are about to name it.

Formal definitions

These five subspace facts are tied together by one clean accounting identity, the rank–nullity theorem, for an m×nm \times n matrix:

Read it as conservation of dimensions. You start with nn input dimensions. The transformation routes some of them into genuinely distinct outputs (that count is the rank) and collapses the rest to zero (that count is the nullity). Nothing appears or disappears; the nn input dimensions are split between "kept" and "crushed."

Numerical example: are these vectors independent, and what do they span?

Take v1=(1,2)\mathbf{v}_1 = (1, 2) and v2=(2,4)\mathbf{v}_2 = (2, 4). Ask the definition's question: is there a nontrivial c1,c2c_1, c_2 with c1v1+c2v2=0c_1\mathbf{v}_1 + c_2\mathbf{v}_2 = \mathbf{0}? Notice v2=2v1\mathbf{v}_2 = 2\mathbf{v}_1, so 2v1v2=02\mathbf{v}_1 - \mathbf{v}_2 = \mathbf{0} with c1=2, c2=1c_1 = 2,\ c_2 = -1 — nontrivial. The pair is linearly dependent. Their span is not the plane but a single line, the direction (1,2)(1, 2): every combination c1(1,2)+c2(2,4)=(c1+2c2)(1,2)c_1(1,2) + c_2(2,4) = (c_1 + 2c_2)(1, 2) is just a rescaling of one direction.

Now nudge the second vector: v2=(2,5)\mathbf{v}_2 = (2, 5). Suppose c1(1,2)+c2(2,5)=(0,0)c_1(1,2) + c_2(2,5) = (0,0). The two component equations are c1+2c2=0c_1 + 2c_2 = 0 and 2c1+5c2=02c_1 + 5c_2 = 0. From the first, c1=2c2c_1 = -2c_2; substitute into the second: 2(2c2)+5c2=c2=02(-2c_2) + 5c_2 = c_2 = 0, hence c1=0c_1 = 0 too. Only the trivial solution — the vectors are independent, they span all of R2\mathbb{R}^2, and together they form a basis of the plane. Stacked as columns, the first matrix has rank 1 and the second has rank 2. Independence and rank are the same question asked two ways.

Why the columns are the images of the basis vectors

This also explains the collapse you saw in the lab. If two columns are dependent (one is a multiple of the other), they span only a line, so rank=1\operatorname{rank} = 1: no matter what input you feed, the output is stuck on that line. The other input direction — the combination that the dependent columns cancel — is sent to 0\mathbf{0} and lives in the null space. Rank-deficient means some direction of input never comes out the far side.

ML use case: rank bounds capacity, and low-rank is a feature

A dense layer computes y=Wx\mathbf{y} = W\mathbf{x} (plus a bias and a nonlinearity). Everything the linear part can express is confined to Col(W)\operatorname{Col}(W), whose dimension is rank(W)\operatorname{rank}(W). So the rank is a hard ceiling on representational capacity: a layer mapping R1000R1000\mathbb{R}^{1000} \to \mathbb{R}^{1000} but with rank(W)=10\operatorname{rank}(W) = 10 can only ever produce outputs in a 10-dimensional sheet, no matter how the inputs vary. It has 990 directions of null space — 990 input combinations it silently erases. Low rank means lost information.

That loss is sometimes exactly what you want. Two ideas across modern ML lean on it deliberately:

  • Embeddings as a low-dimensional basis. A vocabulary of 50,000 tokens does not need 50,000 independent directions; meaning lives on a much lower-dimensional manifold. An embedding table maps each token to, say, a 256-dimensional vector — choosing a small basis in which "similar" tokens sit close together. The whole premise is that the useful information is low-rank.
  • Low-rank factorization (the LoRA intuition). A full weight update ΔW\Delta W of shape d×dd \times d has d2d^2 parameters. But if the useful update is low-rank, we can write ΔW=BA\Delta W = BA where BB is d×rd \times r and AA is r×dr \times d with rdr \ll d. Since rank(BA)r\operatorname{rank}(BA) \le r, this factorization cannot exceed rank rr — and that is the point: it captures the low-rank part of the adaptation with 2dr2dr parameters instead of d2d^2. When d=4096d = 4096 and r=8r = 8, that is a 256× reduction. The bet, borne out in practice, is that fine-tuning lives in a low-rank subspace, so the discarded directions were not carrying much.

The through-line: rank is the currency of information a linear map moves. Bound it low to save memory and it costs you expressiveness; that trade is favorable exactly when the signal was low-rank to begin with.

NumPy: measuring rank and testing independence

NumPy computes rank directly with np.linalg.matrix_rank, which counts singular values above a numerical tolerance — the robust way to ask "how many independent directions?" without hand-solving systems. Below we build a rank-deficient matrix on purpose (a third column that is a combination of the first two), confirm its rank, and use rank as an independence test. Run it:

rank_and_independence.py

The pattern matrix_rank(M) == M.shape[1] is the practical linear-independence test for a set of column vectors: full column rank means no column is redundant. Prefer matrix_rank over checking the determinant — the determinant only works for square matrices and is numerically fragile, whereas rank is defined for any shape and thresholds singular values sensibly.

Summary

  • A matrix is a linear transformation: it moves all of space, and it is fully determined by where the basis vectors land — those landing spots are its columns. AejA\mathbf{e}_j is the jj-th column.
  • The span of a set is all its linear combinations. A set is linearly independent when no vector is redundant (the only combination giving 0\mathbf{0} is trivial). A basis is an independent spanning set; its size is the dimension.
  • The column space is the span of the columns — the reachable outputs. The null space is the inputs crushed to 0\mathbf{0}. Rank is the dimension of the column space = number of independent columns.
  • Rank–nullity: rank(A)+dimNull(A)=n\operatorname{rank}(A) + \dim\operatorname{Null}(A) = n. Input dimensions are conserved, split between "kept" and "crushed."
  • In ML, rank bounds capacity: a low-rank layer collapses dimensions and loses information. Low-rank factorization (LoRA, embeddings) turns that collapse into a deliberate compression when the signal is genuinely low-rank.
  • In NumPy, np.linalg.matrix_rank(A) measures rank; rank == n_cols tests independence. Independence \ne orthogonality, and least squares needs full column rank for a unique fit.

Active recall

Answer from memory before checking the lesson:

  1. Where do the standard basis vectors "land" under the matrix AA, and how does that relate to the columns of AA?
  2. State the difference between the column space and the null space of AA, including which ambient space each lives in.
  3. A 6×46 \times 4 matrix has rank 3. What is the dimension of its null space, and why? Which theorem did you use?
  4. Give two vectors that are linearly independent but not orthogonal.
  5. In one sentence, why does replacing a d×dd \times d weight matrix with a rank-rr factorization BABA (rdr \ll d) save parameters, and what is the risk?

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge