Vector Spaces and Linear Transformations

Why "where the basis lands" is the whole game

By now a matrix is, for you, a grid of numbers you can multiply against a vector. That view is correct but inert. The view that makes matrices click — the one that turns linear algebra from bookkeeping into a spatial intuition you can reason with — is this: a matrix is a linear transformation of space. It picks up every point, stretches and rotates and shears the whole grid, and sets it back down. And a transformation that acts on infinitely many points is fully described by a tiny amount of data: where the basis vectors land.

That single reframing answers questions that matter in ML. When a weight matrix maps a 4096-dimensional hidden state into a 4096-dimensional output, how much information can it actually carry? When we replace a big matrix with the product of two skinny ones — the trick behind LoRA adapters and embedding tables — what exactly are we giving up? The vocabulary for all of this is span, basis, rank, and null space, and every one of them is a statement about which directions a transformation keeps and which it destroys.

Intuition: a transformation is pinned down by where the basis lands

Start in the plane with the two standard basis vectors $\mathbf{e}_1 = (1, 0)$ and $\mathbf{e}_2 = (0, 1)$ . Every other vector is a linear combination of them: $(3, 1) = 3\mathbf{e}_1 + 1\mathbf{e}_2$ . Now here is the defining property of a linear transformation $T$ — it commutes with addition and scaling:

$T(c\,\mathbf{u} + d\,\mathbf{v}) = c\,T(\mathbf{u}) + d\,T(\mathbf{v}).$

Feed $(3,1)$ through it and the property lets you pull the transformation apart:

$T(3\mathbf{e}_1 + \mathbf{e}_2) = 3\,T(\mathbf{e}_1) + T(\mathbf{e}_2).$

Read that again. To know what $T$ does to any vector, you only need to know $T(\mathbf{e}_1)$ and $T(\mathbf{e}_2)$ — the coordinates of the same vector follow along for the ride. So if I tell you $\mathbf{e}_1$ lands on $(2, 1)$ and $\mathbf{e}_2$ lands on $(-1, 1)$ , you can transform anything.

And where do we store those two landing spots? As the columns of a matrix:

A = \begin{bmatrix} 2 & -1 \\ 1 & 1 \end{bmatrix}, \qquad A\mathbf{e}_1 = \begin{bmatrix} 2 \\ 1 \end{bmatrix}, \qquad A\mathbf{e}_2 = \begin{bmatrix} -1 \\ 1 \end{bmatrix}.

That is the entire secret of matrix–vector multiplication: the columns of $A$ are the images of the basis vectors, and $A\mathbf{x}$ just recombines those columns using $\mathbf{x}$ 's coordinates as the weights. A matrix is a table of "where the basis lands."

Interactive LabMatrix Transformation Visualizer

Loading interactive lab…

Drag the columns above and watch the whole grid deform with them. Try to make the two columns point along the same line — the grid collapses from a 2-D sheet onto a 1-D line, and a whole dimension of space is crushed to nothing. Hold that picture; it is exactly what rank deficiency means, and we are about to name it.

Formal definitions

Symbol	Meaning	Type	Shape	Role
$\operatorname{span}\{\mathbf{v}_i\}$	All linear combinations of the vectors	set	subspace	derived
$A$	A matrix / linear transformation	matrix	m×n	variable
$\operatorname{Col}(A)$	Column space (span of columns; reachable outputs)	set	⊆ ℝ^m	derived
$\operatorname{Null}(A)$	Null space (inputs mapped to 0)	set	⊆ ℝ^n	derived
$\operatorname{rank}(A)$	Dimension of the column space	integer	1	derived
$\dim V$	Dimension (size of any basis of V)	integer	1	derived

These five subspace facts are tied together by one clean accounting identity, the rank–nullity theorem, for an $m \times n$ matrix:

\operatorname{rank}(A) \;+\; \dim \operatorname{Null}(A) \;=\; n

(10.1)

Read it as conservation of dimensions. You start with $n$ input dimensions. The transformation routes some of them into genuinely distinct outputs (that count is the rank) and collapses the rest to zero (that count is the nullity). Nothing appears or disappears; the $n$ input dimensions are split between "kept" and "crushed."

Numerical example: are these vectors independent, and what do they span?

Take $\mathbf{v}_1 = (1, 2)$ and $\mathbf{v}_2 = (2, 4)$ . Ask the definition's question: is there a nontrivial $c_1, c_2$ with $c_1\mathbf{v}_1 + c_2\mathbf{v}_2 = \mathbf{0}$ ? Notice $\mathbf{v}_2 = 2\mathbf{v}_1$ , so $2\mathbf{v}_1 - \mathbf{v}_2 = \mathbf{0}$ with $c_1 = 2,\ c_2 = -1$ — nontrivial. The pair is linearly dependent. Their span is not the plane but a single line, the direction $(1, 2)$ : every combination $c_1(1,2) + c_2(2,4) = (c_1 + 2c_2)(1, 2)$ is just a rescaling of one direction.

Now nudge the second vector: $\mathbf{v}_2 = (2, 5)$ . Suppose $c_1(1,2) + c_2(2,5) = (0,0)$ . The two component equations are $c_1 + 2c_2 = 0$ and $2c_1 + 5c_2 = 0$ . From the first, $c_1 = -2c_2$ ; substitute into the second: $2(-2c_2) + 5c_2 = c_2 = 0$ , hence $c_1 = 0$ too. Only the trivial solution — the vectors are independent, they span all of $\mathbb{R}^2$ , and together they form a basis of the plane. Stacked as columns, the first matrix has rank 1 and the second has rank 2. Independence and rank are the same question asked two ways.

Why the columns are the images of the basis vectors

columns of A = where the basis lands

Write $\mathbf{x} = (x_1, \ldots, x_n)$ in the standard basis, so $\mathbf{x} = x_1\mathbf{e}_1 + \cdots + x_n\mathbf{e}_n$ . Apply $A$ and use linearity: $A\mathbf{x} = A\!\left(\sum_{j=1}^{n} x_j \mathbf{e}_j\right) = \sum_{j=1}^{n} x_j\,(A\mathbf{e}_j).$ Now compute $A\mathbf{e}_j$ . Because $\mathbf{e}_j$ has a 1 in slot $j$ and zeros elsewhere, the matrix–vector product picks out exactly the $j$ -th column of $A$ ; call it $\mathbf{a}_j$ . Therefore $A\mathbf{x} = x_1\mathbf{a}_1 + x_2\mathbf{a}_2 + \cdots + x_n\mathbf{a}_n.$ Two consequences fall out immediately. First, $A\mathbf{e}_j = \mathbf{a}_j$ : the $j$ -th column is the image of the $j$ -th basis vector — where $\mathbf{e}_j$ lands. Second, every output $A\mathbf{x}$ is a linear combination of the columns, so the set of all outputs is precisely $\operatorname{span}$ of the columns — which is why we named it the column space. The rank, the number of independent columns, is thus the dimension of the output sheet the transformation can reach.

This also explains the collapse you saw in the lab. If two columns are dependent (one is a multiple of the other), they span only a line, so $\operatorname{rank} = 1$ : no matter what input you feed, the output is stuck on that line. The other input direction — the combination that the dependent columns cancel — is sent to $\mathbf{0}$ and lives in the null space. Rank-deficient means some direction of input never comes out the far side.

ML use case: rank bounds capacity, and low-rank is a feature

A dense layer computes $\mathbf{y} = W\mathbf{x}$ (plus a bias and a nonlinearity). Everything the linear part can express is confined to $\operatorname{Col}(W)$ , whose dimension is $\operatorname{rank}(W)$ . So the rank is a hard ceiling on representational capacity: a layer mapping $\mathbb{R}^{1000} \to \mathbb{R}^{1000}$ but with $\operatorname{rank}(W) = 10$ can only ever produce outputs in a 10-dimensional sheet, no matter how the inputs vary. It has 990 directions of null space — 990 input combinations it silently erases. Low rank means lost information.

That loss is sometimes exactly what you want. Two ideas across modern ML lean on it deliberately:

Embeddings as a low-dimensional basis. A vocabulary of 50,000 tokens does not need 50,000 independent directions; meaning lives on a much lower-dimensional manifold. An embedding table maps each token to, say, a 256-dimensional vector — choosing a small basis in which "similar" tokens sit close together. The whole premise is that the useful information is low-rank.
Low-rank factorization (the LoRA intuition). A full weight update $\Delta W$ of shape $d \times d$ has $d^2$ parameters. But if the useful update is low-rank, we can write $\Delta W = BA$ where $B$ is $d \times r$ and $A$ is $r \times d$ with $r \ll d$ . Since $\operatorname{rank}(BA) \le r$ , this factorization cannot exceed rank $r$ — and that is the point: it captures the low-rank part of the adaptation with $2dr$ parameters instead of $d^2$ . When $d = 4096$ and $r = 8$ , that is a 256× reduction. The bet, borne out in practice, is that fine-tuning lives in a low-rank subspace, so the discarded directions were not carrying much.

The through-line: rank is the currency of information a linear map moves. Bound it low to save memory and it costs you expressiveness; that trade is favorable exactly when the signal was low-rank to begin with.

NumPy: measuring rank and testing independence

NumPy computes rank directly with np.linalg.matrix_rank, which counts singular values above a numerical tolerance — the robust way to ask "how many independent directions?" without hand-solving systems. Below we build a rank-deficient matrix on purpose (a third column that is a combination of the first two), confirm its rank, and use rank as an independence test. Run it:

rank_and_independence.py

import numpy as np
np.random.seed(0)

# Two independent columns, then a THIRD that is 2*col0 - col1 (dependent).
c0 = np.array([1.0, 0.0, 2.0])
c1 = np.array([0.0, 1.0, 1.0])
c2 = 2 * c0 - c1                       # a linear combination of c0, c1
A = np.column_stack([c0, c1, c2])     # shape (3, 3)

print("A =\n", A)
print("shape:", A.shape)              # (3, 3)

# Rank counts INDEPENDENT columns. Three columns, but one is redundant.
r = np.linalg.matrix_rank(A)
print("rank:", r)                     # 2, not 3 -> rank deficient

# Independence test: a set of columns is independent iff
# rank == number of columns.
def is_independent(M):
  return np.linalg.matrix_rank(M) == M.shape[1]

print("A columns independent? ", is_independent(A))            # False
print("first two independent? ", is_independent(A[:, :2]))     # True

# Rank deficiency means a nonzero null-space vector exists: A @ x = 0.
# Here x = (2, -1, -1) since 2*c0 - 1*c1 - 1*c2 = 0.
x = np.array([2.0, -1.0, -1.0])
print("A @ x =", A @ x)               # [0. 0. 0.]
assert np.linalg.matrix_rank(A) == 2
assert np.allclose(A @ x, 0.0)
print("ok")

The pattern matrix_rank(M) == M.shape[1] is the practical linear-independence test for a set of column vectors: full column rank means no column is redundant. Prefer matrix_rank over checking the determinant — the determinant only works for square matrices and is numerically fragile, whereas rank is defined for any shape and thresholds singular values sensibly.

Independence is not orthogonality; full rank is not everything

Two traps worth separating. First, linear independence is weaker than orthogonality. The columns $(1,0)$ and $(1,1)$ are independent (neither is a multiple of the other) and form a valid basis — but they are not perpendicular; their dot product is 1. Orthogonality implies independence, never the reverse. A basis need only be independent, not orthogonal, to span a space.

Second, for least squares you need full column rank. The normal-equation solution $\hat{\mathbf{x}} = (A^\top A)^{-1} A^\top \mathbf{b}$ requires $A^\top A$ to be invertible, which happens exactly when $A$ has full column rank (independent columns). If two feature columns are linearly dependent — collinear — the fit is not unique: infinitely many weight vectors give the same predictions, and the inverse blows up numerically. Rank deficiency in your design matrix is the mathematical face of multicollinearity.

Summary

A matrix is a linear transformation: it moves all of space, and it is fully determined by where the basis vectors land — those landing spots are its columns. $A\mathbf{e}_j$ is the $j$ -th column.
The span of a set is all its linear combinations. A set is linearly independent when no vector is redundant (the only combination giving $\mathbf{0}$ is trivial). A basis is an independent spanning set; its size is the dimension.
The column space is the span of the columns — the reachable outputs. The null space is the inputs crushed to $\mathbf{0}$ . Rank is the dimension of the column space = number of independent columns.
Rank–nullity: $\operatorname{rank}(A) + \dim\operatorname{Null}(A) = n$ . Input dimensions are conserved, split between "kept" and "crushed."
In ML, rank bounds capacity: a low-rank layer collapses dimensions and loses information. Low-rank factorization (LoRA, embeddings) turns that collapse into a deliberate compression when the signal is genuinely low-rank.
In NumPy, np.linalg.matrix_rank(A) measures rank; rank == n_cols tests independence. Independence $\ne$ orthogonality, and least squares needs full column rank for a unique fit.

Active recall

Answer from memory before checking the lesson:

Where do the standard basis vectors "land" under the matrix $A$ , and how does that relate to the columns of $A$ ?
State the difference between the column space and the null space of $A$ , including which ambient space each lives in.
A $6 \times 4$ matrix has rank 3. What is the dimension of its null space, and why? Which theorem did you use?
Give two vectors that are linearly independent but not orthogonal.
In one sentence, why does replacing a $d \times d$ weight matrix with a rank- $r$ factorization $BA$ ( $r \ll d$ ) save parameters, and what is the risk?

Exercises

Level ARecall & basic calculation

Level AHand calculationch10-A1

Are these two vectors independent?

Are $\mathbf{v}_1 = (2, 3)$ and $\mathbf{v}_2 = (4, 6)$ linearly independent? Enter 1 for independent or 0 for dependent.

Level AHand calculationch10-A2

Read the rank off the columns

The matrix $A = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}$ has how many linearly independent columns, i.e. what is $\operatorname{rank}(A)$ ?

Level AEquation interpretationch10-A3

Where does e_2 land?

For $A = \begin{bmatrix} 2 & -1 \\ 1 & 3 \end{bmatrix}$ , compute $A\mathbf{e}_2$ where $\mathbf{e}_2 = (0,1)$ . Enter as x, y.

Level AHand calculationch10-A4

Rank–nullity arithmetic

A matrix $A$ has 5 columns and $\operatorname{rank}(A) = 3$ . What is $\dim \operatorname{Null}(A)$ ?

Level AEquation interpretationch10-A5

Dimension of a span

What is the dimension of $\operatorname{span}\{(1,0,0),\ (0,1,0),\ (1,1,0)\}$ in $\mathbb{R}^3$ ?

Level AEquation interpretationch10-A6

Definition of the null space

The null space of $A$ is the set of vectors $\mathbf{x}$ satisfying which equation?

Level BConceptual understanding

Level BShape reasoningch10-B1

Column space vs. null space — which space?

Let $A$ be a $3 \times 5$ matrix. The column space $\operatorname{Col}(A)$ is a subspace of which space, and the null space $\operatorname{Null}(A)$ is a subspace of which space?

Level BShape reasoningch10-B2

Maximum possible rank

What is the largest value $\operatorname{rank}(A)$ can take for a $4 \times 7$ matrix $A$ ?

Level BEquation interpretationch10-B3

Independent but not orthogonal

True or false: 'If two vectors are linearly independent, they must be orthogonal.' Enter 1 for true or 0 for false, and be ready to justify.

Level BML applicationch10-B4

Why low rank loses information

A linear layer $\mathbf{y} = W\mathbf{x}$ has $W \in \mathbb{R}^{1000 \times 1000}$ but $\operatorname{rank}(W) = 10$ . In one or two sentences, explain what this implies about the outputs the layer can produce and about information lost from the input.

Level BEquation interpretationch10-B5

When does Ax = b have a solution?

For a fixed matrix $A$ , the system $A\mathbf{x} = \mathbf{b}$ has at least one solution exactly when:

Level CDerivation & implementation

Level CNumPy implementationch10-C1

Compute rank and test independence in NumPy

Build the matrix with columns $(1,2,3)$ , $(2,4,6)$ , and $(0,1,0)$ using np.column_stack. Print its rank with np.linalg.matrix_rank, decide whether the three columns are independent (via rank == n_cols), and print ok.

Level CDerivationch10-C2

Find a basis for a column space

The matrix $A$ has columns $\mathbf{a}_1 = (1,1,0)$ , $\mathbf{a}_2 = (2,2,0)$ , $\mathbf{a}_3 = (0,0,1)$ . Identify a basis for $\operatorname{Col}(A)$ and state $\operatorname{rank}(A)$ .

Level CDerivationch10-C3

Find a null-space vector by reasoning

For $A = \begin{bmatrix} 1 & 2 & 3 \\ 0 & 1 & 1 \end{bmatrix}$ , find a nonzero vector $\mathbf{x} = (x_1, x_2, x_3)$ with $A\mathbf{x} = \mathbf{0}$ , and explain why the null space must be nonzero here.

Level DResearch-thinking challenge

Level DPaper-reading practicech10-D1

Why low-rank adapters work (LoRA)

A LoRA adapter replaces a weight update $\Delta W \in \mathbb{R}^{d\times d}$ with a product $BA$ where $B \in \mathbb{R}^{d\times r}$ , $A \in \mathbb{R}^{r\times d}$ , and $r \ll d$ . (a) Prove $\operatorname{rank}(BA) \le r$ . (b) Count the parameters saved for $d = 4096, r = 8$ . (c) State the empirical hypothesis that makes this a good trade, and one situation where it would fail.

Level DML applicationch10-D2

Rank collapse through stacked layers

Consider a purely linear network $\mathbf{y} = W_3 W_2 W_1 \mathbf{x}$ with each $W_i \in \mathbb{R}^{d\times d}$ . (a) If $\operatorname{rank}(W_2) = k < d$ , what can you say about $\operatorname{rank}(W_3 W_2 W_1)$ ? (b) What does this imply about information flowing through the network, and (c) why do real networks insert nonlinearities between layers?