Part 3 · Core Linear AlgebraChapter 885 min

Matrices

Grids of numbers that move space

Prerequisites

Learning objectives

  • Read matrix shapes and index entries with (row, column)
  • Compute matrix–vector and matrix–matrix products by hand
  • Track dimensions through a chain of products
  • Recognize transpose, identity, diagonal, and symmetric matrices

Why matrices run machine learning

Every forward pass of every neural network is, underneath, a sequence of matrix multiplications. A dense layer is Wx+b\mathbf{W}\mathbf{x} + \mathbf{b}. A batch of inputs through that layer is one big matrix product XW\mathbf{X}\mathbf{W}^\top. An attention block computes scores as QK\mathbf{Q}\mathbf{K}^\top. Stack more layers and you are simply chaining more matmuls. If the vector was the noun of linear algebra, the matrix is the verb: it is the object that acts on vectors.

Because of this, the single most valuable skill in this chapter is not arithmetic — NumPy does the arithmetic — it is shape reasoning. Experienced ML engineers debug models by tracking shapes through a computation the way you track types through a function signature. A shape mismatch is the linear-algebra equivalent of a type error, and learning to see it before you run the code is a genuine superpower. We will emphasize it relentlessly.

We meet the matrix two ways at once:

  • As a grid — a rectangular table of numbers with mm rows and nn columns.
  • As an operator — a machine that eats an nn-vector and produces an mm-vector.

Fluency means switching between "table of numbers" and "thing that transforms vectors" without friction.

Intuition: a grid that transforms space

Picture a 2×22 \times 2 matrix as a rule for moving every point in the plane. The matrix A=[2001]\mathbf{A} = \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix} stretches everything horizontally by a factor of 2 and leaves the vertical direction alone. Feed it the point (1,1)(1, 1) and it returns (2,1)(2, 1). The columns of A\mathbf{A} tell you exactly where the two basis arrows land: the first column (2,0)(2, 0) is where (1,0)(1, 0) goes, the second column (0,1)(0, 1) is where (0,1)(0, 1) goes. That is the whole secret of the operator view — the columns of a matrix are the images of the basis vectors, and everything else follows by linear combination.

The first lab lets you build a product entry by entry and watch which row and which column each output cell comes from. Play with it before we write the formal rule.

Interactive LabMatrix Multiplication Explorer
Loading interactive lab…

Notice the pattern: the entry in output row ii, column jj is formed from row ii of the left matrix and column jj of the right matrix. Hold onto that; it is the definition.

Formal definitions

The row-first convention is worth burning in: A[i][j] is row ii, column jj, and NumPy's A.shape reports (rows, cols) in that order. Swapping the two is the most common indexing bug in ML code.

Addition and scalar multiplication

Two matrices of the same shape add entry-by-entry, and any matrix scales by a number entry-by-entry:

These are exactly the vector rules applied to a grid. Like vector addition, matrix addition demands identical shapes — a 2×32 \times 3 and a 3×23 \times 2 cannot be added.

The matrix–vector product

There are two equally important ways to read this formula, and switching between them is the heart of the chapter.

  • Row view (dot products). Entry ii of Ax\mathbf{A}\mathbf{x} is the dot product of row ii of A\mathbf{A} with x\mathbf{x}. So Ax\mathbf{A}\mathbf{x} stacks mm dot products.
  • Column view (linear combination). Ax\mathbf{A}\mathbf{x} is a weighted sum of the columns of A\mathbf{A}, with the entries of x\mathbf{x} as weights. We derive this below.

The matrix–matrix product

This inner/outer rule is the one you will use hundreds of times. Read a chain left to right and cross out each matching inner pair: (m×n)(n×p)(p×q)(m×q).(m \times n)(n \times p)(p \times q) \to (m \times q). If at any junction the inner numbers disagree, the product does not exist — no arithmetic required to know it fails.

Transpose, identity, diagonal, symmetric, trace

A few named square matrices earn their own vocabulary:

  • The identity In\mathbf{I}_n has 11s on the diagonal and 00s elsewhere; it is the "do nothing" operator: Ix=x\mathbf{I}\mathbf{x} = \mathbf{x} and AI=A\mathbf{A}\mathbf{I} = \mathbf{A}.
  • A diagonal matrix is zero off the diagonal; multiplying by it just scales each coordinate independently.
  • A symmetric matrix satisfies A=A\mathbf{A} = \mathbf{A}^\top (so aij=ajia_{ij} = a_{ji}). Covariance matrices and Gram matrices XX\mathbf{X}^\top\mathbf{X} are always symmetric.
  • The trace tr(A)=iaii\operatorname{tr}(\mathbf{A}) = \sum_{i} a_{ii} sums the diagonal. It is only defined for square matrices.

A 2×2 product, worked by hand

Derivation: Ax is a linear combination of the columns

The column view is the one that makes neural networks click, so let us derive it. Write A\mathbf{A} by its columns a1,,an\mathbf{a}_1, \ldots, \mathbf{a}_n (each an mm-vector), and x=(x1,,xn)\mathbf{x} = (x_1, \ldots, x_n).

Two readings of the same product, then: rows dotted with x\mathbf{x} (compute each output number) versus columns combined by x\mathbf{x} (understand what the matrix does to space). Keep both.

ML use case: layers, batches, and attention are all shapes

Every one of these is the same operation. Master the shape rule once and you read model code like prose.

The transformation view

The second lab shows the operator picture directly: a 2×22 \times 2 matrix bends the plane, and you watch a shape and the basis arrows move as you edit the four entries. Try making a pure rotation, a shear, and a reflection, and watch where the columns (the basis-vector images) land.

Interactive LabMatrix Transformation Visualizer
Loading interactive lab…

Set the matrix to the identity [1001]\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} and nothing moves — the "do nothing" operator. Make it diagonal and each axis scales on its own. This is the same A\mathbf{A} from the grid view, seen as an action.

NumPy: three ways to the same product

In NumPy a matrix is a 2-D array; A.shape is (rows, cols). Let us compute a matrix product with an explicit triple loop (what the entry formula says literally), then with the @ operator, and confirm they agree. Run it:

matmul.py

The triple loop is O(mnp)O(mnp) scalar multiply-adds, and it is only for seeing the formula. In real code you always write A @ B: NumPy dispatches to a tuned BLAS routine that runs the same arithmetic in compiled, cache-aware code, orders of magnitude faster. Prefer @; reach for loops only to explain.

Summary

  • A matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} is both a grid of numbers (entry aija_{ij} = A[i][j], row first) and an operator taking nn-vectors to mm-vectors.
  • Same-shape matrices add and scale entry-by-entry.
  • Ax\mathbf{A}\mathbf{x} reads two ways: as mm dot products of rows with x\mathbf{x}, and as a linear combination of the columns of A\mathbf{A} weighted by x\mathbf{x}.
  • AB\mathbf{A}\mathbf{B} needs matching inner dimensions (m×n)(n×p)(m×p)(m \times n)\cdot(n \times p) \to (m \times p); track shapes through a chain by cancelling adjacent inner pairs. It is not commutative.
  • Transpose flips shape (m×nn×mm \times n \to n \times m) and reverses products; the identity does nothing, diagonal matrices scale axes, symmetric means A=A\mathbf{A} = \mathbf{A}^\top, and the trace sums the diagonal.
  • In ML a layer is Wx+b\mathbf{W}\mathbf{x} + \mathbf{b}, a batch is XW+b\mathbf{X}\mathbf{W}^\top + \mathbf{b}, and attention scores are QK\mathbf{Q}\mathbf{K}^\top — all read off by the shape rule. Use A @ B and keep (n,) distinct from (n, 1).

Active recall

Answer from memory before checking the lesson:

  1. A has shape (4, 7) and B has shape (7, 3). What is the shape of A @ B? What about B @ A?
  2. Give the two readings of Ax\mathbf{A}\mathbf{x} — one in terms of rows, one in terms of columns.
  3. Why does a batched dense layer use XW\mathbf{X}\mathbf{W}^\top rather than XW\mathbf{X}\mathbf{W}? Reason purely from shapes.
  4. What is tr(A)\operatorname{tr}(\mathbf{A}), and for which matrices is it defined?
  5. In NumPy, what does A[i][j] mean — row ii column jj, or column ii row jj?

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge