Part 5 · NumPy LaboratoryChapter 1980 min

Implementing Mathematics with NumPy

From loops to vectorized operations, correctly

Learning objectives

  • Implement dot products, norms, and matmul with and without loops
  • Vectorize numerical differentiation and feature normalization
  • Write shape assertions that catch bugs early
  • Compare loop vs vectorized versions for speed and readability

Why loops teach and vectorization runs

Every equation in this course has, up to now, lived on the page. This chapter is the bridge to code: we take the operations you already know by hand — dot product, matrix multiply, norm, distance, similarity, a layer's forward pass — and write each one twice. First as an explicit Python loop that reads like the summation in the definition, then as a single vectorized NumPy expression. We run both, assert they agree to floating-point tolerance, and note what we bought.

The two versions play different roles, and keeping them straight is the whole point:

  • The loop is for understanding. It is the definition, transcribed. If you can write the loop, you know what the operation means — which indices pair up, which axis gets summed away, what shape falls out.
  • The vectorized form is for running. It hands the inner loop to NumPy's compiled C over contiguous memory, so it is typically one to two orders of magnitude faster and, once you can read it, far shorter. Real ML code — every forward pass in PyTorch, JAX, or TensorFlow — is vectorized array ops all the way down. Nobody ships the loop.

There is a third character in this story who never leaves: the shape. The single most common class of bug in numerical code is not a wrong formula, it is a wrong shape — a (n,) where you meant (n, 1), a mean taken over the wrong axis, an accidental outer product where you wanted an element-wise result. So every cell below asserts the shapes it expects. Treat those assertions as executable documentation.

Intuition: the loop is the spec, the array op is the machine

Picture the dot product ab=iaibi\mathbf{a}\cdot\mathbf{b} = \sum_i a_i b_i. The loop version says, out loud, "walk index ii from 00 to n1n-1, multiply the matching entries, accumulate." That is the specification. The vectorized version, a @ b, says nothing about how — it just names the operation and lets NumPy choose the fastest march through memory. Same answer, different altitude.

The discipline that makes this safe is to always know two things before you run a line: what shape goes in, and what shape comes out. Vectorization does not free you from thinking about indices; it moves that thinking from inside the loop to the boundaries of the array — its shape. Master the boundary and the interior takes care of itself.

Formal note: vectorization and broadcasting

Throughout, we use the standard ML layout: a batch is a 2-D array XX of shape (N,D)(N, D)NN examples stacked along axis 00, each a DD-dimensional feature vector along axis 11. We seed every random array with np.random.default_rng(0) so the numbers, and therefore the assertions, are reproducible.

Dot product

Recap. For a,bRn\mathbf{a}, \mathbf{b} \in \mathbb{R}^n, the dot product is ab=i=1naibi\mathbf{a}\cdot\mathbf{b} = \sum_{i=1}^{n} a_i b_i — multiply matching entries, add them up, get one scalar. The loop is that sum verbatim; the vectorized form is a @ b.

dot_product.py

The loop is the definition; a @ b is what you write in real code. For length 1000 the difference is milliseconds, but the vectorized form scales to millions of elements without leaving C — the loop does not.

Matrix multiplication

Recap. For ARm×kA \in \mathbb{R}^{m\times k} and BRk×nB \in \mathbb{R}^{k\times n}, the product C=ABC = AB has shape (m,n)(m, n) with Cij=p=1kAipBpjC_{ij} = \sum_{p=1}^{k} A_{ip} B_{pj} — every output entry is a dot product of a row of AA with a column of BB. The naive version is the triple loop that name says; the vectorized version is A @ B. The inner dimensions must match (k=kk = k), and that is the first thing to assert.

matmul.py

Three nested Python loops versus one operator. On real matrices the vectorized call dispatches to a tuned BLAS routine (blocked, cache-aware, multi-threaded) that the triple loop cannot begin to match — this is the single biggest speed win in the chapter.

L2 norm

Recap. The Euclidean length of xRn\mathbf{x} \in \mathbb{R}^n is x2=ixi2=xx\lVert\mathbf{x}\rVert_2 = \sqrt{\sum_i x_i^2} = \sqrt{\mathbf{x}\cdot\mathbf{x}}. The loop accumulates the sum of squares; the vectorized form is np.linalg.norm(x) (or np.sqrt(x @ x)).

l2_norm.py

np.linalg.norm is not only faster, it is also more careful about overflow than a naive square-and-add — another reason to prefer the library call over the loop in production.

Euclidean distance matrix

Recap. Given NN points as rows of XRN×DX \in \mathbb{R}^{N\times D}, the pairwise distance matrix DmatD_{\text{mat}} has shape (N,N)(N, N) with Dmat[i,j]=xixj2D_{\text{mat}}[i, j] = \lVert \mathbf{x}_i - \mathbf{x}_j \rVert_2. The loop fills each entry with a norm; the vectorized form builds the differences with broadcasting — X[:, None, :] - X[None, :, :] has shape (N,N,D)(N, N, D) — then sums the squares over the last axis and takes the root.

distance_matrix.py

The [:, None, :] / [None, :, :] trick is the workhorse pattern for turning a per-pair loop into one broadcast. It trades O(N2D)O(N^2 D) Python iterations for a single C sweep — at the cost of materializing the (N,N,D)(N, N, D) block, so watch memory for large NN.

Cosine similarity

Recap. Cosine similarity measures orientation, ignoring magnitude: cosθ=abab\cos\theta = \dfrac{\mathbf{a}\cdot\mathbf{b}}{\lVert\mathbf{a}\rVert\,\lVert\mathbf{b}\rVert}. It reuses the dot product and the norm, so the vectorized form is a one-liner — and a positively scaled copy of a vector must return exactly 11.

cosine.py

Cosine is the default similarity for embeddings precisely because of that scale invariance — it compares direction, not length. Note the vectorized version is built entirely from the two operations we just implemented; complex ML formulas are almost always compositions of a handful of primitives.

A linear layer: y = Wx + b for a batch

Recap. A dense layer maps each input xRD\mathbf{x} \in \mathbb{R}^D to y=Wx+bRH\mathbf{y} = W\mathbf{x} + \mathbf{b} \in \mathbb{R}^H, with WW of shape (H,D)(H, D) and b\mathbf{b} of shape (H,)(H,). For a whole batch XX of shape (N,D)(N, D), the clean vectorized form is Y=XW+bY = X W^\top + \mathbf{b}, giving shape (N,H)(N, H) — the bias broadcasts across all NN rows. This is the arithmetic core of every neural network.

linear_layer.py

Three nested loops collapse to X @ W.T + b. Note the bias is (H,), not (H, 1): broadcasting left-pads it to (1, H) and stretches it down all NN rows automatically. Give it shape (H, 1) by mistake and you get a (N, H) against (H, 1) broadcast that either errors or, worse, silently produces the wrong shape.

Numerical differentiation (central difference)

Recap. When you cannot (or will not) differentiate by hand, approximate the derivative with a central difference: f(x)f(x+h)f(xh)2hf'(x) \approx \dfrac{f(x+h) - f(x-h)}{2h} for a small step hh. Its error shrinks like O(h2)O(h^2), so it is far more accurate than the one-sided version. The loop evaluates it point by point; the vectorized form applies ff to the entire grid at once.

central_difference.py

Because f is written in terms of array operations (x ** 3), the vectorized central difference is the same formula with the loop deleted — NumPy evaluates f on all nine points in one shot. Matching the analytic 3x23x^2 to atol=1e-4 confirms the approximation is doing its job.

Feature normalization (standardization)

Recap. Before training, features are usually standardized per column to zero mean and unit variance: Z=XμσZ = \dfrac{X - \boldsymbol{\mu}}{\boldsymbol{\sigma}}, where μ\boldsymbol{\mu} and σ\boldsymbol{\sigma} are the per-column mean and standard deviation. The axis matters enormously: statistics are taken over axis=0 (down the examples), giving vectors of shape (D,)(D,) that then broadcast across every row. Guard against a zero standard deviation (a constant column) so you never divide by zero.

standardize.py

axis=0 is load-bearing: X.mean(axis=0) gives one number per feature (shape (D,)), which is what standardization wants. Use axis=1 and you would center each example against its own features — a completely different, and almost always wrong, operation. This single character is one of the most common bugs in preprocessing code.

Mean squared error

Recap. The mean squared error between targets y\mathbf{y} and predictions y^\hat{\mathbf{y}} (both length NN) is MSE=1Ni=1N(yiy^i)2\mathrm{MSE} = \dfrac{1}{N}\sum_{i=1}^{N}(y_i - \hat y_i)^2. The loop sums the squared residuals; the vectorized form is np.mean((y - yhat) ** 2).

mse.py

Keep both operands 1-D of shape (N,). The classic trap here is writing y[:, None] - yhat[None, :], which broadcasts to an (N,N)(N, N) outer difference and averages over N2N^2 cross-pairs instead of the NN matched ones — it runs without error and reports nonsense. assert (y - yhat).shape == y.shape catches it instantly.

ML use case: normalize, then run the batched layer

The two most ML-relevant cells above are not incidental — they are the first two lines of a real training step, in order:

  1. Standardize the batch. Raw features arrive on wildly different scales (an age in years next to an income in dollars). Z = (X - mu) / sd with axis=0 statistics puts every feature on a comparable footing, which stabilizes gradient descent and stops large-scale features from dominating the loss. In practice you compute mu and sd on the training set and reuse them on validation and test data — never refit the statistics per split.
  2. Push it through the layer. Y = Z @ W.T + b maps the normalized batch of shape (N,D)(N, D) to activations of shape (N,H)(N, H) in a single matmul-plus-bias. Stack a nonlinearity and another such layer and you have a neural network; the arithmetic never gets more exotic than what is in this chapter.

Everything else — the loss (an MSE or cross-entropy over the batch), the gradients, the update — is more of the same: array operations chosen so the shape of the output is exactly what the next step expects. Fluency in these primitives is fluency in ML code.

Consolidated warnings

Summary

  • Write each operation twice: a Python loop that reads like the definition (for understanding) and a vectorized NumPy expression (for running). Confirm they agree with np.allclose, then keep the vectorized form.
  • Vectorized code is typically one to two orders of magnitude faster because the inner loop runs in compiled C (matmul dispatches to tuned BLAS) instead of the Python interpreter — and it is shorter once you can read shapes fluently.
  • Every operation has a predictable output shape: dot product, norm, and MSE give a scalar; matmul (m,k)×(k,n)(m,k)\times(k,n) gives (m,n)(m,n); the distance matrix is (N,N)(N, N); a batched layer XW+bXW^\top + \mathbf{b} is (N,H)(N, H). Assert it.
  • Broadcasting turns per-pair and per-feature loops into single expressions: X[:, None, :] - X[None, :, :] for distances, X - mu for centering. It fails silently when shapes are accidentally compatible, so assert output shapes.
  • The recurring bug classes are (n,) vs (n, 1), a missing or wrong axis= in mean/std, in-place mutation through a view, and integer-dtype truncation or overflow. Seed with np.random.default_rng(0) for reproducibility and cast to float64 before computing statistics.

Active recall

Answer from memory before checking the lesson:

  1. Write the dot product of a,bRn\mathbf{a}, \mathbf{b} \in \mathbb{R}^n as an explicit loop and as one NumPy expression. What shape is the result?
  2. For AA of shape (3, 4) and BB of shape (4, 2), what shape is A @ B? Which dimension must match, and which one disappears?
  3. In a batched layer X @ W.T + b, what are the shapes of X, W, b, and the output? Why must the bias be (H,) and not (H, 1)?
  4. To standardize a feature matrix X of shape (N, D), over which axis do you take the mean and std, and what shape are they? What goes wrong with the other axis?
  5. An engineer computes MSE as np.mean((y[:, None] - yhat[None, :]) ** 2) and gets a suspiciously small number. What shape is the difference, and what single assertion would have caught the bug?

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge