Implementing Mathematics with NumPy
From loops to vectorized operations, correctly
Prerequisites
Learning objectives
- Implement dot products, norms, and matmul with and without loops
- Vectorize numerical differentiation and feature normalization
- Write shape assertions that catch bugs early
- Compare loop vs vectorized versions for speed and readability
Why loops teach and vectorization runs
Every equation in this course has, up to now, lived on the page. This chapter is the bridge to code: we take the operations you already know by hand — dot product, matrix multiply, norm, distance, similarity, a layer's forward pass — and write each one twice. First as an explicit Python loop that reads like the summation in the definition, then as a single vectorized NumPy expression. We run both, assert they agree to floating-point tolerance, and note what we bought.
The two versions play different roles, and keeping them straight is the whole point:
- The loop is for understanding. It is the definition, transcribed. If you can write the loop, you know what the operation means — which indices pair up, which axis gets summed away, what shape falls out.
- The vectorized form is for running. It hands the inner loop to NumPy's
compiled C over contiguous memory, so it is typically one to two orders of
magnitude faster and, once you can read it, far shorter. Real ML code — every
forwardpass in PyTorch, JAX, or TensorFlow — is vectorized array ops all the way down. Nobody ships the loop.
There is a third character in this story who never leaves: the shape. The
single most common class of bug in numerical code is not a wrong formula, it is a
wrong shape — a (n,) where you meant (n, 1), a mean taken over the wrong
axis, an accidental outer product where you wanted an element-wise result. So
every cell below asserts the shapes it expects. Treat those assertions as
executable documentation.
Intuition: the loop is the spec, the array op is the machine
Picture the dot product . The loop
version says, out loud, "walk index from to , multiply the matching
entries, accumulate." That is the specification. The vectorized version, a @ b,
says nothing about how — it just names the operation and lets NumPy choose the
fastest march through memory. Same answer, different altitude.
The discipline that makes this safe is to always know two things before you run a line: what shape goes in, and what shape comes out. Vectorization does not free you from thinking about indices; it moves that thinking from inside the loop to the boundaries of the array — its shape. Master the boundary and the interior takes care of itself.
Formal note: vectorization and broadcasting
Throughout, we use the standard ML layout: a batch is a 2-D array of
shape — examples stacked along axis , each a -dimensional
feature vector along axis . We seed every random array with
np.random.default_rng(0) so the numbers, and therefore the assertions, are
reproducible.
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| Number of examples in a batch | integer | 1 | fixed | |
| Number of input features per example | integer | 1 | fixed | |
| Number of output units (layer width) | integer | 1 | fixed | |
| A batch of inputs | matrix | (N, D) | input | |
| Layer weight matrix | matrix | (H, D) | parameter | |
| Bias vector (one per output unit) | vector | (H,) | parameter | |
| Batch of outputs, Y = XW^\top + b | matrix | (N, H) | output |
Dot product
Recap. For , the dot product is
— multiply matching entries,
add them up, get one scalar. The loop is that sum verbatim; the vectorized form
is a @ b.
The loop is the definition; a @ b is what you write in real code. For length
1000 the difference is milliseconds, but the vectorized form scales to millions
of elements without leaving C — the loop does not.
Matrix multiplication
Recap. For and ,
the product has shape with
— every output entry is a dot product of a
row of with a column of . The naive version is the triple loop that name
says; the vectorized version is A @ B. The inner dimensions must match
(), and that is the first thing to assert.
Three nested Python loops versus one operator. On real matrices the vectorized call dispatches to a tuned BLAS routine (blocked, cache-aware, multi-threaded) that the triple loop cannot begin to match — this is the single biggest speed win in the chapter.
L2 norm
Recap. The Euclidean length of is
.
The loop accumulates the sum of squares; the vectorized form is
np.linalg.norm(x) (or np.sqrt(x @ x)).
np.linalg.norm is not only faster, it is also more careful about overflow than
a naive square-and-add — another reason to prefer the library call over the loop
in production.
Euclidean distance matrix
Recap. Given points as rows of , the
pairwise distance matrix has shape with
. The loop
fills each entry with a norm; the vectorized form builds the differences with
broadcasting — X[:, None, :] - X[None, :, :] has shape — then sums
the squares over the last axis and takes the root.
The [:, None, :] / [None, :, :] trick is the workhorse pattern for turning a
per-pair loop into one broadcast. It trades Python iterations for a
single C sweep — at the cost of materializing the block, so watch
memory for large .
Cosine similarity
Recap. Cosine similarity measures orientation, ignoring magnitude: . It reuses the dot product and the norm, so the vectorized form is a one-liner — and a positively scaled copy of a vector must return exactly .
Cosine is the default similarity for embeddings precisely because of that scale invariance — it compares direction, not length. Note the vectorized version is built entirely from the two operations we just implemented; complex ML formulas are almost always compositions of a handful of primitives.
A linear layer: y = Wx + b for a batch
Recap. A dense layer maps each input to , with of shape and of shape . For a whole batch of shape , the clean vectorized form is , giving shape — the bias broadcasts across all rows. This is the arithmetic core of every neural network.
Three nested loops collapse to X @ W.T + b. Note the bias is (H,), not
(H, 1): broadcasting left-pads it to (1, H) and stretches it down all
rows automatically. Give it shape (H, 1) by mistake and you get a (N, H)
against (H, 1) broadcast that either errors or, worse, silently produces the
wrong shape.
Numerical differentiation (central difference)
Recap. When you cannot (or will not) differentiate by hand, approximate the derivative with a central difference: for a small step . Its error shrinks like , so it is far more accurate than the one-sided version. The loop evaluates it point by point; the vectorized form applies to the entire grid at once.
Because f is written in terms of array operations (x ** 3), the vectorized
central difference is the same formula with the loop deleted — NumPy evaluates
f on all nine points in one shot. Matching the analytic to atol=1e-4
confirms the approximation is doing its job.
Feature normalization (standardization)
Recap. Before training, features are usually standardized per column to
zero mean and unit variance: ,
where and are the per-column mean and
standard deviation. The axis matters enormously: statistics are taken over
axis=0 (down the examples), giving vectors of shape that then broadcast
across every row. Guard against a zero standard deviation (a constant column) so
you never divide by zero.
axis=0 is load-bearing: X.mean(axis=0) gives one number per feature (shape
(D,)), which is what standardization wants. Use axis=1 and you would center
each example against its own features — a completely different, and almost
always wrong, operation. This single character is one of the most common bugs in
preprocessing code.
Mean squared error
Recap. The mean squared error between targets and predictions
(both length ) is
. The loop sums the
squared residuals; the vectorized form is np.mean((y - yhat) ** 2).
Keep both operands 1-D of shape (N,). The classic trap here is writing
y[:, None] - yhat[None, :], which broadcasts to an outer difference
and averages over cross-pairs instead of the matched ones — it runs
without error and reports nonsense. assert (y - yhat).shape == y.shape catches
it instantly.
ML use case: normalize, then run the batched layer
The two most ML-relevant cells above are not incidental — they are the first two lines of a real training step, in order:
- Standardize the batch. Raw features arrive on wildly different scales (an
age in years next to an income in dollars).
Z = (X - mu) / sdwithaxis=0statistics puts every feature on a comparable footing, which stabilizes gradient descent and stops large-scale features from dominating the loss. In practice you computemuandsdon the training set and reuse them on validation and test data — never refit the statistics per split. - Push it through the layer.
Y = Z @ W.T + bmaps the normalized batch of shape to activations of shape in a single matmul-plus-bias. Stack a nonlinearity and another such layer and you have a neural network; the arithmetic never gets more exotic than what is in this chapter.
Everything else — the loss (an MSE or cross-entropy over the batch), the gradients, the update — is more of the same: array operations chosen so the shape of the output is exactly what the next step expects. Fluency in these primitives is fluency in ML code.
Consolidated warnings
Summary
- Write each operation twice: a Python loop that reads like the definition
(for understanding) and a vectorized NumPy expression (for running). Confirm
they agree with
np.allclose, then keep the vectorized form. - Vectorized code is typically one to two orders of magnitude faster because the inner loop runs in compiled C (matmul dispatches to tuned BLAS) instead of the Python interpreter — and it is shorter once you can read shapes fluently.
- Every operation has a predictable output shape: dot product, norm, and MSE give a scalar; matmul gives ; the distance matrix is ; a batched layer is . Assert it.
- Broadcasting turns per-pair and per-feature loops into single expressions:
X[:, None, :] - X[None, :, :]for distances,X - mufor centering. It fails silently when shapes are accidentally compatible, so assert output shapes. - The recurring bug classes are
(n,)vs(n, 1), a missing or wrongaxis=inmean/std, in-place mutation through a view, and integer-dtype truncation or overflow. Seed withnp.random.default_rng(0)for reproducibility and cast tofloat64before computing statistics.
Active recall
Answer from memory before checking the lesson:
- Write the dot product of as an explicit loop and as one NumPy expression. What shape is the result?
- For of shape
(3, 4)and of shape(4, 2), what shape isA @ B? Which dimension must match, and which one disappears? - In a batched layer
X @ W.T + b, what are the shapes ofX,W,b, and the output? Why must the bias be(H,)and not(H, 1)? - To standardize a feature matrix
Xof shape(N, D), over which axis do you take the mean and std, and what shape are they? What goes wrong with the other axis? - An engineer computes MSE as
np.mean((y[:, None] - yhat[None, :]) ** 2)and gets a suspiciously small number. What shape is the difference, and what single assertion would have caught the bug?
Exercises
Level ARecall & basic calculation
Shape of a matrix product
A has shape (3, 4) and B has shape (4, 2). What is the shape of A @ B? Enter it as a tuple, e.g. (r, c).
Dot product by hand
Compute the dot product a @ b for a = np.array([1.0, 2.0, 3.0]) and b = np.array([4.0, 5.0, 6.0]).
Mean squared error by hand
For targets y = np.array([3.0, 5.0]) and predictions yhat = np.array([1.0, 4.0]), compute the mean squared error .
Shape of a per-feature mean
X has shape (100, 4) — 100 examples, 4 features. What is the shape of X.mean(axis=0)? Enter as a tuple, e.g. (k,).
The other axis
Same X of shape (100, 4). What is the shape of X.mean(axis=1)? Enter as a tuple.
Which expression is the dot product?
For two 1-D arrays a and b of equal length, which expression computes their dot product as a single scalar?
Level BConceptual understanding
Shape of a distance matrix
X holds points as rows, shape (N, D). What is the shape of the pairwise Euclidean distance matrix ?
Debug a bias-shaped layer
In a batched layer, X is (N, D), W is (H, D), and an engineer writes b = np.zeros((H, 1)), then Y = X @ W.T + b. The output shape is not (N, H) as expected. What shape does Y actually have, why, and what is the one-character-idea fix?
Shape of the difference block
X has shape (6, 3). What is the shape of X[:, None, :] - X[None, :, :] (the broadcast used to build a distance matrix)? Enter as a tuple, e.g. (a, b, c).
Why guard the standard deviation?
Standardization computes Z = (X - mu) / sd with sd = X.std(axis=0). Explain in one or two sentences what happens if a feature column is constant, and how the guard sd = np.where(sd > 0, sd, 1.0) prevents it without distorting the data.
Loop vs vectorized: what actually gets faster?
An engineer replaces a Python for-loop dot product with a @ b and sees a large speedup. Which statement best explains why?
Level CDerivation & implementation
Implement per-column standardization
Write standardize(X) for a batch X of shape (N, D) that returns (X - mu) / sd using per-column statistics (axis=0), guarding against zero standard deviation. Verify the output has zero mean and unit std per column, assert the output shape equals the input shape, then print ok.
Batched linear layer: loop vs vectorized
Implement layer_loop(X, W, b) with explicit loops and layer_vec(X, W, b) as X @ W.T + b, for X of shape (N, D), W of shape (H, D), and b of shape (H,). Use a fixed seed, assert both outputs are (N, H) and agree with np.allclose, then print ok.
Debug a matmul shape error
An engineer has inputs X of shape (N, D) and weights W of shape (H, D) and writes Y = X @ W. It raises ValueError: matmul: ... (N,D) and (H,D). Explain precisely why matmul rejects these shapes, give the corrected expression, and state the resulting shape.
Vectorized Euclidean distance matrix
Implement dist_matrix(X) for X of shape (N, D) returning the (N, N) matrix of pairwise Euclidean distances, using broadcasting (no Python loop over pairs). Verify the diagonal is zero and the matrix is symmetric, assert the shape, then print ok.
Level DResearch-thinking challenge
When vectorization costs too much memory
The broadcast distance matrix materializes an (N, N, D) difference block. For and in float64, estimate that block's memory, explain why the fully vectorized one-liner becomes impractical, and describe a strategy that keeps most of the vectorization speed without allocating the full block.
Choosing the step size for a central difference
A central difference has truncation error , which suggests taking as small as possible. Yet in float64 a tiny makes the estimate worse. Explain the two competing error sources, why the total error is minimized at an intermediate (roughly near machine epsilon ), and what this implies for verifying analytic gradients in ML.