Matrices

Why matrices run machine learning

Every forward pass of every neural network is, underneath, a sequence of matrix multiplications. A dense layer is $\mathbf{W}\mathbf{x} + \mathbf{b}$ . A batch of inputs through that layer is one big matrix product $\mathbf{X}\mathbf{W}^\top$ . An attention block computes scores as $\mathbf{Q}\mathbf{K}^\top$ . Stack more layers and you are simply chaining more matmuls. If the vector was the noun of linear algebra, the matrix is the verb: it is the object that acts on vectors.

Because of this, the single most valuable skill in this chapter is not arithmetic — NumPy does the arithmetic — it is shape reasoning. Experienced ML engineers debug models by tracking shapes through a computation the way you track types through a function signature. A shape mismatch is the linear-algebra equivalent of a type error, and learning to see it before you run the code is a genuine superpower. We will emphasize it relentlessly.

We meet the matrix two ways at once:

As a grid — a rectangular table of numbers with $m$ rows and $n$ columns.
As an operator — a machine that eats an $n$ -vector and produces an $m$ -vector.

Fluency means switching between "table of numbers" and "thing that transforms vectors" without friction.

Intuition: a grid that transforms space

Picture a $2 \times 2$ matrix as a rule for moving every point in the plane. The matrix $\mathbf{A} = \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix}$ stretches everything horizontally by a factor of 2 and leaves the vertical direction alone. Feed it the point $(1, 1)$ and it returns $(2, 1)$ . The columns of $\mathbf{A}$ tell you exactly where the two basis arrows land: the first column $(2, 0)$ is where $(1, 0)$ goes, the second column $(0, 1)$ is where $(0, 1)$ goes. That is the whole secret of the operator view — the columns of a matrix are the images of the basis vectors, and everything else follows by linear combination.

The first lab lets you build a product entry by entry and watch which row and which column each output cell comes from. Play with it before we write the formal rule.

Interactive LabMatrix Multiplication Explorer

Loading interactive lab…

Notice the pattern: the entry in output row $i$ , column $j$ is formed from row $i$ of the left matrix and column $j$ of the right matrix. Hold onto that; it is the definition.

Formal definitions

The row-first convention is worth burning in: A[i][j] is row $i$ , column $j$ , and NumPy's A.shape reports (rows, cols) in that order. Swapping the two is the most common indexing bug in ML code.

Symbol	Meaning	Type	Shape	Role
$\mathbf{A}$	A matrix	matrix	m×n	variable
$a_{ij}$	Entry in row i, column j (a number)	scalar	1	variable
$\mathbf{A}^\top$	Transpose (rows become columns)	matrix	n×m	variable
$\mathbf{I}_n$	n×n identity matrix	matrix	n×n	fixed
$\operatorname{tr}(\mathbf{A})$	Trace: sum of diagonal entries	scalar	1	operation
$\mathbf{A}\mathbf{x}$	Matrix–vector product	vector	m×1	operation
$\mathbf{A}\mathbf{B}$	Matrix–matrix product	matrix	m×p	operation

Addition and scalar multiplication

Two matrices of the same shape add entry-by-entry, and any matrix scales by a number entry-by-entry:

(\mathbf{A} + \mathbf{B})_{ij} = a_{ij} + b_{ij}, \qquad (c\,\mathbf{A})_{ij} = c\,a_{ij}

(8.1)

These are exactly the vector rules applied to a grid. Like vector addition, matrix addition demands identical shapes — a $2 \times 3$ and a $3 \times 2$ cannot be added.

The matrix–vector product

There are two equally important ways to read this formula, and switching between them is the heart of the chapter.

Row view (dot products). Entry $i$ of $\mathbf{A}\mathbf{x}$ is the dot product of row $i$ of $\mathbf{A}$ with $\mathbf{x}$ . So $\mathbf{A}\mathbf{x}$ stacks $m$ dot products.
Column view (linear combination). $\mathbf{A}\mathbf{x}$ is a weighted sum of the columns of $\mathbf{A}$ , with the entries of $\mathbf{x}$ as weights. We derive this below.

The matrix–matrix product

This inner/outer rule is the one you will use hundreds of times. Read a chain left to right and cross out each matching inner pair: $(m \times n)(n \times p)(p \times q) \to (m \times q).$ If at any junction the inner numbers disagree, the product does not exist — no arithmetic required to know it fails.

Transpose, identity, diagonal, symmetric, trace

A few named square matrices earn their own vocabulary:

The identity $\mathbf{I}_n$ has $1$ s on the diagonal and $0$ s elsewhere; it is the "do nothing" operator: $\mathbf{I}\mathbf{x} = \mathbf{x}$ and $\mathbf{A}\mathbf{I} = \mathbf{A}$ .
A diagonal matrix is zero off the diagonal; multiplying by it just scales each coordinate independently.
A symmetric matrix satisfies $\mathbf{A} = \mathbf{A}^\top$ (so $a_{ij} = a_{ji}$ ). Covariance matrices and Gram matrices $\mathbf{X}^\top\mathbf{X}$ are always symmetric.
The trace $\operatorname{tr}(\mathbf{A}) = \sum_{i} a_{ii}$ sums the diagonal. It is only defined for square matrices.

A 2×2 product, worked by hand

Worked Example — multiplying two 2×2 matrices

Let $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \qquad \mathbf{B} = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}.$ Both are $2 \times 2$ , so the inner dimensions match ( $2 = 2$ ) and the product is $2 \times 2$ . Each output entry is row $i$ of $\mathbf{A}$ dotted with column $j$ of $\mathbf{B}$ :

$c_{11} = (1)(5) + (2)(7) = 5 + 14 = 19, \qquad c_{12} = (1)(6) + (2)(8) = 6 + 16 = 22,$ $c_{21} = (3)(5) + (4)(7) = 15 + 28 = 43, \qquad c_{22} = (3)(6) + (4)(8) = 18 + 32 = 50.$

So $\mathbf{A}\mathbf{B} = \begin{bmatrix} 19 & 22 \\ 43 & 50 \end{bmatrix}.$

Check that order matters: $\mathbf{B}\mathbf{A} = \begin{bmatrix} 23 & 34 \\ 31 & 46 \end{bmatrix}$ , which is different. Matrix multiplication is not commutative.

Derivation: Ax is a linear combination of the columns

The column view is the one that makes neural networks click, so let us derive it. Write $\mathbf{A}$ by its columns $\mathbf{a}_1, \ldots, \mathbf{a}_n$ (each an $m$ -vector), and $\mathbf{x} = (x_1, \ldots, x_n)$ .

from the entry formula to the column combination

Start from the definition, $(\mathbf{A}\mathbf{x})_i = \sum_{j=1}^{n} a_{ij} x_j$ . Hold $j$ fixed and look at the whole output column as $i$ ranges over $1, \ldots, m$ . The terms with weight $x_j$ are exactly $x_j$ times the vector $(a_{1j}, a_{2j}, \ldots, a_{mj})$ — which is the $j$ -th column $\mathbf{a}_j$ . Summing over $j$ , $\mathbf{A}\mathbf{x} = \sum_{j=1}^{n} x_j\, \mathbf{a}_j = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n.$ So $\mathbf{A}\mathbf{x}$ is a linear combination of the columns of $\mathbf{A}$ , weighted by the entries of $\mathbf{x}$ . This is why we said the columns are where the basis vectors land: taking $\mathbf{x} = (1, 0, \ldots, 0)$ picks out $\mathbf{a}_1$ exactly.

Two readings of the same product, then: rows dotted with $\mathbf{x}$ (compute each output number) versus columns combined by $\mathbf{x}$ (understand what the matrix does to space). Keep both.

ML use case: layers, batches, and attention are all shapes

ML Connection

A dense layer. One input $\mathbf{x} \in \mathbb{R}^{n}$ through a layer with weight matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$ and bias $\mathbf{b} \in \mathbb{R}^{m}$ produces $\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b} \in \mathbb{R}^{m}.$ Shape check: $(m \times n)\cdot(n \times 1) \to (m \times 1)$ , then add the $(m \times 1)$ bias. The layer maps $n$ features to $m$ outputs — the operator view.

A batch. We rarely push one example at a time. Stack $N$ inputs as rows of a data matrix $\mathbf{X} \in \mathbb{R}^{N \times n}$ . To apply the same layer to every row at once we compute $\mathbf{Z} = \mathbf{X}\mathbf{W}^\top + \mathbf{b} \in \mathbb{R}^{N \times m}.$ Why the transpose? $\mathbf{X}$ is $N \times n$ and $\mathbf{W}$ is $m \times n$ ; to make the inner dimensions meet we need the $n$ on the inside, so we use $\mathbf{W}^\top$ ( $n \times m$ ): $(N \times n)\cdot(n \times m) \to (N \times m)$ . Row $r$ of $\mathbf{Z}$ is exactly the layer applied to example $r$ . This one transpose trick — data-in-rows, weights transposed — is the shape at the core of every training loop.

Attention. Queries $\mathbf{Q} \in \mathbb{R}^{N \times d}$ and keys $\mathbf{K} \in \mathbb{R}^{N \times d}$ produce a score matrix $\mathbf{S} = \mathbf{Q}\mathbf{K}^\top \in \mathbb{R}^{N \times N},$ one similarity per query–key pair. Shape-wise: $(N \times d)\cdot(d \times N) \to (N \times N)$ . You do not need to know what attention means yet to predict its output shape — that is the power of the inner/outer rule.

Every one of these is the same operation. Master the shape rule once and you read model code like prose.

The transformation view

The second lab shows the operator picture directly: a $2 \times 2$ matrix bends the plane, and you watch a shape and the basis arrows move as you edit the four entries. Try making a pure rotation, a shear, and a reflection, and watch where the columns (the basis-vector images) land.

Interactive LabMatrix Transformation Visualizer

Loading interactive lab…

Set the matrix to the identity $\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$ and nothing moves — the "do nothing" operator. Make it diagonal and each axis scales on its own. This is the same $\mathbf{A}$ from the grid view, seen as an action.

NumPy: three ways to the same product

In NumPy a matrix is a 2-D array; A.shape is (rows, cols). Let us compute a matrix product with an explicit triple loop (what the entry formula says literally), then with the @ operator, and confirm they agree. Run it:

matmul.py

import numpy as np
np.random.seed(0)

A = np.array([[1.0, 2.0, 3.0],
            [4.0, 5.0, 6.0]])   # shape (2, 3)
B = np.array([[1.0, 0.0],
            [0.0, 1.0],
            [1.0, 1.0]])        # shape (3, 2)

print("A shape:", A.shape)        # (2, 3)
print("B shape:", B.shape)        # (3, 2)

# Inner dims 3 and 3 match -> result is (2, 2) = outer dims.
m, n = A.shape
n2, p = B.shape
assert n == n2, "inner dimensions must match"

# 1) Triple loop: c[i, j] = sum_k A[i, k] * B[k, j]
C_loop = np.zeros((m, p))
for i in range(m):
  for j in range(p):
      for k in range(n):
          C_loop[i, j] += A[i, k] * B[k, j]

# 2) The @ operator (preferred)
C_at = A @ B

print("C shape:", C_at.shape)     # (2, 2)
assert np.allclose(C_loop, C_at), "loop and @ must agree"

# Transpose flips the shape; identity is a no-op.
print("A.T shape:", A.T.shape)    # (3, 2)
I = np.eye(3)                     # 3x3 identity
assert np.allclose(A @ I, A), "multiplying by I changes nothing"

print("ok")

The triple loop is $O(mnp)$ scalar multiply-adds, and it is only for seeing the formula. In real code you always write A @ B: NumPy dispatches to a tuned BLAS routine that runs the same arithmetic in compiled, cache-aware code, orders of magnitude faster. Prefer @; reach for loops only to explain.

Summary

A matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ is both a grid of numbers (entry $a_{ij}$ = A[i][j], row first) and an operator taking $n$ -vectors to $m$ -vectors.
Same-shape matrices add and scale entry-by-entry.
$\mathbf{A}\mathbf{x}$ reads two ways: as $m$ dot products of rows with $\mathbf{x}$ , and as a linear combination of the columns of $\mathbf{A}$ weighted by $\mathbf{x}$ .
$\mathbf{A}\mathbf{B}$ needs matching inner dimensions $(m \times n)\cdot(n \times p) \to (m \times p)$ ; track shapes through a chain by cancelling adjacent inner pairs. It is not commutative.
Transpose flips shape ( $m \times n \to n \times m$ ) and reverses products; the identity does nothing, diagonal matrices scale axes, symmetric means $\mathbf{A} = \mathbf{A}^\top$ , and the trace sums the diagonal.
In ML a layer is $\mathbf{W}\mathbf{x} + \mathbf{b}$ , a batch is $\mathbf{X}\mathbf{W}^\top + \mathbf{b}$ , and attention scores are $\mathbf{Q}\mathbf{K}^\top$ — all read off by the shape rule. Use A @ B and keep (n,) distinct from (n, 1).

Active recall

Answer from memory before checking the lesson:

A has shape (4, 7) and B has shape (7, 3). What is the shape of A @ B? What about B @ A?
Give the two readings of $\mathbf{A}\mathbf{x}$ — one in terms of rows, one in terms of columns.
Why does a batched dense layer use $\mathbf{X}\mathbf{W}^\top$ rather than $\mathbf{X}\mathbf{W}$ ? Reason purely from shapes.
What is $\operatorname{tr}(\mathbf{A})$ , and for which matrices is it defined?
In NumPy, what does A[i][j] mean — row $i$ column $j$ , or column $i$ row $j$ ?

Exercises

Level ARecall & basic calculation

Level AShape reasoningch08-A1

Shape of a matrix product

$\mathbf{A}$ has shape $(2, 3)$ and $\mathbf{B}$ has shape $(3, 4)$ . What is the shape of $\mathbf{A}\mathbf{B}$ ? Enter as (rows, cols).

Level AHand calculationch08-A2

Matrix–vector product by hand

Let $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ and $\mathbf{x} = (1, 1)$ . Compute $\mathbf{A}\mathbf{x}$ . Enter as y1, y2.

Level AHand calculationch08-A3

Indexing an entry

For $\mathbf{A} = \begin{bmatrix} 2 & 4 & 6 \\ 8 & 10 & 12 \\ 14 & 16 & 18 \end{bmatrix}$ , what is A[1][2] using 0-based indexing (as in NumPy: row first, column second)?

Level AShape reasoningch08-A4

Shape after transpose

$\mathbf{A}$ has shape $(5, 2)$ . What is the shape of $\mathbf{A}^\top$ ? Enter as (rows, cols).

Level AHand calculationch08-A5

Matrix addition

Compute $\mathbf{A} + \mathbf{B}$ for $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ and $\mathbf{B} = \begin{bmatrix} 10 & 20 \\ 30 & 40 \end{bmatrix}$ . Enter the entries row by row: a11, a12, a21, a22.

Level AHand calculationch08-A6

Trace of a matrix

Compute $\operatorname{tr}(\mathbf{A})$ for $\mathbf{A} = \begin{bmatrix} 3 & 1 & 0 \\ 2 & 5 & 7 \\ 1 & 0 & 4 \end{bmatrix}$ .

Level BConceptual understanding

Level BShape reasoningch08-B1

Tracking shapes through a chain

With $\mathbf{A}$ of shape $(3, 5)$ , $\mathbf{B}$ of shape $(5, 2)$ , and $\mathbf{C}$ of shape $(2, 7)$ , what is the shape of $\mathbf{A}\mathbf{B}\mathbf{C}$ ? Enter as (rows, cols).

Level BEquation interpretationch08-B2

Why AB is not BA

Let $\mathbf{A}$ have shape $(2, 3)$ and $\mathbf{B}$ have shape $(3, 2)$ . Which statement is correct?

Level BHand calculationch08-B3

Ax as a combination of columns

Using the column view, compute $\mathbf{A}\mathbf{x}$ for $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ and $\mathbf{x} = (2, 1)$ : form $x_1\mathbf{a}_1 + x_2\mathbf{a}_2$ . Enter as y1, y2.

Level BEquation interpretationch08-B4

The identity does nothing

For $\mathbf{A}$ of shape $(4, 3)$ and $\mathbf{I}_3$ the $3 \times 3$ identity, what is $\mathbf{A}\mathbf{I}_3$ ?

Level BML applicationch08-B5

The batch layer shape

A dense layer has weights $\mathbf{W}$ of shape $(20, 10)$ (20 outputs, 10 inputs). A batch of data $\mathbf{X}$ has shape $(64, 10)$ (64 examples, each with 10 features). What is the shape of $\mathbf{X}\mathbf{W}^\top$ ? Enter as (rows, cols).

Level CDerivation & implementation

Level CNumPy implementationch08-C1

Implement matmul: loop vs @

Write matmul_loop(A, B) using an explicit triple loop over $i, j, k$ (the entry formula $c_{ij} = \sum_k a_{ik} b_{kj}$ ), and confirm it agrees with A @ B on random matrices with a fixed seed. Assert the output shape is (A.shape[0], B.shape[1]), then print ok.

Level CNumPy implementationch08-C2

Matrix–vector product, both views

Implement matvec_rows(A, x) (stack of row dot products) and matvec_cols(A, x) (weighted sum of columns), and confirm both equal A @ x. Use a fixed seed with $\mathbf{A}$ of shape $(3, 4)$ and $\mathbf{x}$ of length $4$ . Print ok.

Level CNumPy implementationch08-C3

Transpose reverses a product

Verify numerically that $(\mathbf{A}\mathbf{B})^\top = \mathbf{B}^\top \mathbf{A}^\top$ (and that the naive $\mathbf{A}^\top \mathbf{B}^\top$ generally is not equal, and may even be a shape error). Use a fixed seed with $\mathbf{A}$ of shape $(2, 3)$ and $\mathbf{B}$ of shape $(3, 4)$ . Print ok.

Level CNumPy implementationch08-C4

A Gram matrix is symmetric

For any data matrix $\mathbf{X}$ of shape $(N, d)$ , show numerically that the Gram matrix $\mathbf{G} = \mathbf{X}^\top \mathbf{X}$ is square, has shape $(d, d)$ , and is symmetric ( $\mathbf{G} = \mathbf{G}^\top$ ). Use a fixed seed with $N = 5$ , $d = 3$ , and print ok.

Level DResearch-thinking challenge

Level DShape reasoningch08-D1

Attention, purely by shape

Self-attention computes $\operatorname{softmax}(\mathbf{Q}\mathbf{K}^\top)\mathbf{V}$ , where $\mathbf{Q}, \mathbf{K}, \mathbf{V}$ each have shape $(N, d)$ (here $N$ = sequence length, $d$ = head dimension). Softmax is applied row-wise and does not change shape. Working purely from the inner/outer rule, (a) give the shape of the score matrix $\mathbf{Q}\mathbf{K}^\top$ , (b) give the shape of the final output, and (c) explain in one sentence why the score matrix is $N \times N$ regardless of $d$ .

Level DError identificationch08-D2

Debug a stacked-layer shape chain

An engineer stacks two dense layers on a batch. The data is $\mathbf{X}$ of shape $(64, 784)$ ; layer 1 has weights $\mathbf{W}_1$ of shape $(128, 784)$ ; layer 2 has weights $\mathbf{W}_2$ of shape $(10, 128)$ . They write Z = X @ W1 @ W2 and get a shape error on the first product. Explain why it fails, write the corrected expression using transposes, and give the shape after each matmul.