Implementing Mathematics with NumPy

Why loops teach and vectorization runs

Every equation in this course has, up to now, lived on the page. This chapter is the bridge to code: we take the operations you already know by hand — dot product, matrix multiply, norm, distance, similarity, a layer's forward pass — and write each one twice. First as an explicit Python loop that reads like the summation in the definition, then as a single vectorized NumPy expression. We run both, assert they agree to floating-point tolerance, and note what we bought.

The two versions play different roles, and keeping them straight is the whole point:

The loop is for understanding. It is the definition, transcribed. If you can write the loop, you know what the operation means — which indices pair up, which axis gets summed away, what shape falls out.
The vectorized form is for running. It hands the inner loop to NumPy's compiled C over contiguous memory, so it is typically one to two orders of magnitude faster and, once you can read it, far shorter. Real ML code — every forward pass in PyTorch, JAX, or TensorFlow — is vectorized array ops all the way down. Nobody ships the loop.

There is a third character in this story who never leaves: the shape. The single most common class of bug in numerical code is not a wrong formula, it is a wrong shape — a (n,) where you meant (n, 1), a mean taken over the wrong axis, an accidental outer product where you wanted an element-wise result. So every cell below asserts the shapes it expects. Treat those assertions as executable documentation.

Intuition: the loop is the spec, the array op is the machine

Picture the dot product $\mathbf{a}\cdot\mathbf{b} = \sum_i a_i b_i$ . The loop version says, out loud, "walk index $i$ from $0$ to $n-1$ , multiply the matching entries, accumulate." That is the specification. The vectorized version, a @ b, says nothing about how — it just names the operation and lets NumPy choose the fastest march through memory. Same answer, different altitude.

The discipline that makes this safe is to always know two things before you run a line: what shape goes in, and what shape comes out. Vectorization does not free you from thinking about indices; it moves that thinking from inside the loop to the boundaries of the array — its shape. Master the boundary and the interior takes care of itself.

Formal note: vectorization and broadcasting

Throughout, we use the standard ML layout: a batch is a 2-D array $X$ of shape $(N, D)$ — $N$ examples stacked along axis $0$ , each a $D$ -dimensional feature vector along axis $1$ . We seed every random array with np.random.default_rng(0) so the numbers, and therefore the assertions, are reproducible.

Symbol	Meaning	Type	Shape	Role
$N$	Number of examples in a batch	integer	1	fixed
$D$	Number of input features per example	integer	1	fixed
$H$	Number of output units (layer width)	integer	1	fixed
$X$	A batch of inputs	matrix	(N, D)	input
$W$	Layer weight matrix	matrix	(H, D)	parameter
$\mathbf{b}$	Bias vector (one per output unit)	vector	(H,)	parameter
$Y$	Batch of outputs, Y = XW^\top + b	matrix	(N, H)	output

Dot product

Recap. For $\mathbf{a}, \mathbf{b} \in \mathbb{R}^n$ , the dot product is $\mathbf{a}\cdot\mathbf{b} = \sum_{i=1}^{n} a_i b_i$ — multiply matching entries, add them up, get one scalar. The loop is that sum verbatim; the vectorized form is a @ b.

dot_product.py

import numpy as np

rng = np.random.default_rng(0)
a = rng.standard_normal(1000)        # shape (1000,)
b = rng.standard_normal(1000)        # shape (1000,)
assert a.shape == b.shape == (1000,)

# 1) Naive loop -- the summation in the definition, transcribed.
def dot_loop(x, y):
  assert x.shape == y.shape, "dot needs equal-length 1-D vectors"
  s = 0.0
  for i in range(x.shape[0]):
      s += x[i] * y[i]
  return s

# 2) Vectorized: the @ operator hands the multiply-add to compiled C.
loop = dot_loop(a, b)
vec = a @ b

# The result is a scalar (0-D), no matter how large n is.
print("vec is scalar:", np.ndim(vec) == 0)   # True
assert np.isclose(loop, vec), "loop and vectorized must agree"
print("token:", "dot ok")

The loop is the definition; a @ b is what you write in real code. For length 1000 the difference is milliseconds, but the vectorized form scales to millions of elements without leaving C — the loop does not.

Matrix multiplication

Recap. For $A \in \mathbb{R}^{m\times k}$ and $B \in \mathbb{R}^{k\times n}$ , the product $C = AB$ has shape $(m, n)$ with $C_{ij} = \sum_{p=1}^{k} A_{ip} B_{pj}$ — every output entry is a dot product of a row of $A$ with a column of $B$ . The naive version is the triple loop that name says; the vectorized version is A @ B. The inner dimensions must match ( $k = k$ ), and that is the first thing to assert.

matmul.py

import numpy as np

rng = np.random.default_rng(0)
A = rng.standard_normal((3, 4))      # (m, k) = (3, 4)
B = rng.standard_normal((4, 2))      # (k, n) = (4, 2)

def matmul_loop(A, B):
  m, k = A.shape
  k2, n = B.shape
  assert k == k2, f"inner dims must match: {k} vs {k2}"
  C = np.zeros((m, n))
  for i in range(m):               # each output row
      for j in range(n):           # each output col
          for p in range(k):       # the shared inner index (a dot product)
              C[i, j] += A[i, p] * B[p, j]
  return C

C_loop = matmul_loop(A, B)
C_vec = A @ B

# Output shape drops the shared inner dim: (m, k) @ (k, n) -> (m, n).
assert C_loop.shape == (3, 2)
assert C_vec.shape == (3, 2)
assert np.allclose(C_loop, C_vec), "triple loop must match A @ B"
print("token:", "matmul ok")

Three nested Python loops versus one operator. On real matrices the vectorized call dispatches to a tuned BLAS routine (blocked, cache-aware, multi-threaded) that the triple loop cannot begin to match — this is the single biggest speed win in the chapter.

L2 norm

Recap. The Euclidean length of $\mathbf{x} \in \mathbb{R}^n$ is $\lVert\mathbf{x}\rVert_2 = \sqrt{\sum_i x_i^2} = \sqrt{\mathbf{x}\cdot\mathbf{x}}$ . The loop accumulates the sum of squares; the vectorized form is np.linalg.norm(x) (or np.sqrt(x @ x)).

l2_norm.py

import numpy as np

rng = np.random.default_rng(0)
x = rng.standard_normal(500)         # shape (500,)

def norm_loop(v):
  s = 0.0
  for i in range(v.shape[0]):
      s += v[i] ** 2               # accumulate squares
  return s ** 0.5                  # then take the root

loop = norm_loop(x)
vec = np.linalg.norm(x)              # preferred; also np.sqrt(x @ x)

# A norm is a single nonnegative scalar.
print("nonnegative:", vec >= 0.0)    # True
assert np.isclose(loop, vec), "loop and np.linalg.norm must agree"
print("token:", "norm ok")

np.linalg.norm is not only faster, it is also more careful about overflow than a naive square-and-add — another reason to prefer the library call over the loop in production.

Euclidean distance matrix

Recap. Given $N$ points as rows of $X \in \mathbb{R}^{N\times D}$ , the pairwise distance matrix $D_{\text{mat}}$ has shape $(N, N)$ with $D_{\text{mat}}[i, j] = \lVert \mathbf{x}_i - \mathbf{x}_j \rVert_2$ . The loop fills each entry with a norm; the vectorized form builds the differences with broadcasting — X[:, None, :] - X[None, :, :] has shape $(N, N, D)$ — then sums the squares over the last axis and takes the root.

distance_matrix.py

import numpy as np

rng = np.random.default_rng(0)
X = rng.standard_normal((6, 3))      # 6 points in R^3: shape (N, D) = (6, 3)
N = X.shape[0]

def dist_loop(X):
  N = X.shape[0]
  D = np.zeros((N, N))
  for i in range(N):
      for j in range(N):
          diff = X[i] - X[j]       # shape (3,)
          D[i, j] = np.sqrt(diff @ diff)
  return D

# Vectorized: broadcast to an (N, N, D) block of differences, reduce last axis.
diff = X[:, None, :] - X[None, :, :]     # (N, 1, D) - (1, N, D) -> (N, N, D)
D_vec = np.sqrt((diff ** 2).sum(axis=2)) # sum over D -> (N, N)
D_loop = dist_loop(X)

assert diff.shape == (N, N, 3)
assert D_vec.shape == (N, N)
assert np.allclose(D_loop, D_vec), "loop and broadcast distances must agree"
assert np.allclose(np.diag(D_vec), 0.0)  # distance to self is 0
assert np.allclose(D_vec, D_vec.T)       # distance is symmetric
print("token:", "distmat ok")

The [:, None, :] / [None, :, :] trick is the workhorse pattern for turning a per-pair loop into one broadcast. It trades $O(N^2 D)$ Python iterations for a single C sweep — at the cost of materializing the $(N, N, D)$ block, so watch memory for large $N$ .

Cosine similarity

Recap. Cosine similarity measures orientation, ignoring magnitude: $\cos\theta = \dfrac{\mathbf{a}\cdot\mathbf{b}}{\lVert\mathbf{a}\rVert\,\lVert\mathbf{b}\rVert}$ . It reuses the dot product and the norm, so the vectorized form is a one-liner — and a positively scaled copy of a vector must return exactly $1$ .

cosine.py

import numpy as np

rng = np.random.default_rng(0)
a = rng.standard_normal(256)         # e.g. two embeddings, shape (256,)
b = rng.standard_normal(256)

def cosine_loop(x, y):
  dot = 0.0
  nx = 0.0
  ny = 0.0
  for i in range(x.shape[0]):
      dot += x[i] * y[i]
      nx += x[i] ** 2
      ny += y[i] ** 2
  return dot / (nx ** 0.5 * ny ** 0.5)

def cosine_vec(x, y):
  return (x @ y) / (np.linalg.norm(x) * np.linalg.norm(y))

loop = cosine_loop(a, b)
vec = cosine_vec(a, b)

assert np.isclose(loop, vec), "loop and vectorized cosine must agree"
# Scale invariance: a positive rescale does not change direction.
assert np.isclose(cosine_vec(a, 3.0 * a), 1.0)
print("token:", "cosine ok")

Cosine is the default similarity for embeddings precisely because of that scale invariance — it compares direction, not length. Note the vectorized version is built entirely from the two operations we just implemented; complex ML formulas are almost always compositions of a handful of primitives.

A linear layer: y = Wx + b for a batch

Recap. A dense layer maps each input $\mathbf{x} \in \mathbb{R}^D$ to $\mathbf{y} = W\mathbf{x} + \mathbf{b} \in \mathbb{R}^H$ , with $W$ of shape $(H, D)$ and $\mathbf{b}$ of shape $(H,)$ . For a whole batch $X$ of shape $(N, D)$ , the clean vectorized form is $Y = X W^\top + \mathbf{b}$ , giving shape $(N, H)$ — the bias broadcasts across all $N$ rows. This is the arithmetic core of every neural network.

linear_layer.py

import numpy as np

rng = np.random.default_rng(0)
N, D, H = 5, 4, 3
X = rng.standard_normal((N, D))      # batch of inputs (N, D)
W = rng.standard_normal((H, D))      # weights (H, D)
b = rng.standard_normal(H)           # bias (H,), NOT (H, 1)

def layer_loop(X, W, b):
  N, D = X.shape
  H, D2 = W.shape
  assert D == D2, f"W input dim {D2} must match X feature dim {D}"
  Y = np.zeros((N, H))
  for n in range(N):               # each example
      for h in range(H):           # each output unit
          acc = b[h]
          for d in range(D):       # weighted sum of features (a dot product)
              acc += W[h, d] * X[n, d]
          Y[n, h] = acc
  return Y

# Vectorized: one matmul plus a broadcast add. b of shape (H,) broadcasts
# across all N rows of X @ W.T (shape (N, H)).
Y_vec = X @ W.T + b
Y_loop = layer_loop(X, W, b)

assert Y_vec.shape == (N, H)
assert Y_loop.shape == (N, H)
assert np.allclose(Y_loop, Y_vec), "triple loop must match X @ W.T + b"
print("token:", "layer ok")

Three nested loops collapse to X @ W.T + b. Note the bias is (H,), not (H, 1): broadcasting left-pads it to (1, H) and stretches it down all $N$ rows automatically. Give it shape (H, 1) by mistake and you get a (N, H) against (H, 1) broadcast that either errors or, worse, silently produces the wrong shape.

Numerical differentiation (central difference)

Recap. When you cannot (or will not) differentiate by hand, approximate the derivative with a central difference: $f'(x) \approx \dfrac{f(x+h) - f(x-h)}{2h}$ for a small step $h$ . Its error shrinks like $O(h^2)$ , so it is far more accurate than the one-sided version. The loop evaluates it point by point; the vectorized form applies $f$ to the entire grid at once.

central_difference.py

import numpy as np

def f(x):
  return x ** 3                    # analytic derivative: 3 x^2

xs = np.linspace(-2.0, 2.0, 9)       # grid of points, shape (9,)
h = 1e-5

def deriv_loop(f, xs, h):
  g = np.empty_like(xs)
  for i in range(xs.shape[0]):
      g[i] = (f(xs[i] + h) - f(xs[i] - h)) / (2 * h)
  return g

# Vectorized: f is applied element-wise, so no explicit loop is needed.
g_vec = (f(xs + h) - f(xs - h)) / (2 * h)
g_loop = deriv_loop(f, xs, h)
analytic = 3 * xs ** 2               # ground truth, shape (9,)

assert g_vec.shape == xs.shape
assert np.allclose(g_loop, g_vec)
assert np.allclose(g_vec, analytic, atol=1e-4)   # matches calculus
print("token:", "deriv ok")

Because f is written in terms of array operations (x ** 3), the vectorized central difference is the same formula with the loop deleted — NumPy evaluates f on all nine points in one shot. Matching the analytic $3x^2$ to atol=1e-4 confirms the approximation is doing its job.

Feature normalization (standardization)

Recap. Before training, features are usually standardized per column to zero mean and unit variance: $Z = \dfrac{X - \boldsymbol{\mu}}{\boldsymbol{\sigma}}$ , where $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ are the per-column mean and standard deviation. The axis matters enormously: statistics are taken over axis=0 (down the examples), giving vectors of shape $(D,)$ that then broadcast across every row. Guard against a zero standard deviation (a constant column) so you never divide by zero.

standardize.py

import numpy as np

rng = np.random.default_rng(0)
X = rng.standard_normal((100, 4))    # batch (N, D) = (100, 4)
X[:, 3] *= 50.0                      # make column 3 wildly larger-scaled

def standardize_loop(X):
  N, D = X.shape
  out = np.empty_like(X, dtype=float)
  for j in range(D):               # one column at a time
      col = X[:, j]
      mu = col.mean()
      sd = col.std()               # population std (ddof=0)
      sd = sd if sd > 0 else 1.0   # guard: never divide by zero
      out[:, j] = (col - mu) / sd
  return out

# Vectorized: axis=0 collapses the N examples, leaving per-feature (D,) stats.
mu = X.mean(axis=0)                  # shape (4,), one mean per feature
sd = X.std(axis=0)                   # shape (4,), one std per feature
sd = np.where(sd > 0, sd, 1.0)      # guard columns with no variation
Z = (X - mu) / sd                    # (100,4) - (4,) / (4,) -> (100, 4)
Z_loop = standardize_loop(X)

assert mu.shape == (4,) and sd.shape == (4,)
assert Z.shape == X.shape
assert np.allclose(Z_loop, Z)
assert np.allclose(Z.mean(axis=0), 0.0)   # each column now centered
assert np.allclose(Z.std(axis=0), 1.0)    # each column now unit-scaled
print("token:", "standardize ok")

axis=0 is load-bearing: X.mean(axis=0) gives one number per feature (shape (D,)), which is what standardization wants. Use axis=1 and you would center each example against its own features — a completely different, and almost always wrong, operation. This single character is one of the most common bugs in preprocessing code.

Mean squared error

Recap. The mean squared error between targets $\mathbf{y}$ and predictions $\hat{\mathbf{y}}$ (both length $N$ ) is $\mathrm{MSE} = \dfrac{1}{N}\sum_{i=1}^{N}(y_i - \hat y_i)^2$ . The loop sums the squared residuals; the vectorized form is np.mean((y - yhat) ** 2).

mse.py

import numpy as np

rng = np.random.default_rng(0)
y = rng.standard_normal(1000)        # targets, shape (1000,)
yhat = y + 0.1 * rng.standard_normal(1000)   # noisy predictions, shape (1000,)

def mse_loop(y, yhat):
  assert y.shape == yhat.shape, "y and yhat must share shape"
  s = 0.0
  for i in range(y.shape[0]):
      s += (y[i] - yhat[i]) ** 2
  return s / y.shape[0]

# Vectorized: subtract, square, average -- all element-wise, no loop.
loop = mse_loop(y, yhat)
vec = np.mean((y - yhat) ** 2)

# MSE is a single nonnegative scalar.
assert np.ndim(vec) == 0 and vec >= 0.0
assert np.isclose(loop, vec), "loop and vectorized MSE must agree"
print("token:", "mse ok")

Keep both operands 1-D of shape (N,). The classic trap here is writing y[:, None] - yhat[None, :], which broadcasts to an $(N, N)$ outer difference and averages over $N^2$ cross-pairs instead of the $N$ matched ones — it runs without error and reports nonsense. assert (y - yhat).shape == y.shape catches it instantly.

ML use case: normalize, then run the batched layer

The two most ML-relevant cells above are not incidental — they are the first two lines of a real training step, in order:

Standardize the batch. Raw features arrive on wildly different scales (an age in years next to an income in dollars). Z = (X - mu) / sd with axis=0 statistics puts every feature on a comparable footing, which stabilizes gradient descent and stops large-scale features from dominating the loss. In practice you compute mu and sd on the training set and reuse them on validation and test data — never refit the statistics per split.
Push it through the layer. Y = Z @ W.T + b maps the normalized batch of shape $(N, D)$ to activations of shape $(N, H)$ in a single matmul-plus-bias. Stack a nonlinearity and another such layer and you have a neural network; the arithmetic never gets more exotic than what is in this chapter.

Everything else — the loss (an MSE or cross-entropy over the batch), the gradients, the update — is more of the same: array operations chosen so the shape of the output is exactly what the next step expects. Fluency in these primitives is fluency in ML code.

Consolidated warnings

The four shape-and-dtype traps that cause most numerical bugs

1. (n,) vs (n, 1). A 1-D array is neither a row nor a column. Combine it with a 2-D array and it left-pads to (1, n), behaving like a row. A bias (H,) is what a batched layer wants; making it (H, 1) triggers an outer-style broadcast to (N, H) against (H, 1) that either errors or silently misshapes. When something "runs but gives garbage," print .shape on every operand.

2. Forgetting the axis in mean / std. X.mean() with no axis collapses everything to one scalar. Per-feature standardization needs axis=0 (shape (D,)); axis=1 normalizes each example against itself — almost always wrong. The axis you name is the axis that disappears, so pick it deliberately and assert the resulting shape.

3. Modifying arrays in place. A basic slice like Z[:, j] is a view onto shared memory, so writing through it mutates the original — and reusing a buffer across a loop can silently corrupt earlier results. Prefer building a fresh array (np.empty_like, then fill) unless you intend the aliasing, and never assume an operation copied when it may have returned a view.

4. Integer dtype surprises. np.array([1, 2, 3]) is int64; then / 2 promotes to float but // 2 stays integer and truncates, and integer arrays overflow modularly with no warning. Statistics on integer data can be wrong before you even start. Cast feature matrices to float64 up front — e.g. X = np.asarray(X, dtype=float) — so means, stds, and divisions behave.

Summary

Write each operation twice: a Python loop that reads like the definition (for understanding) and a vectorized NumPy expression (for running). Confirm they agree with np.allclose, then keep the vectorized form.
Vectorized code is typically one to two orders of magnitude faster because the inner loop runs in compiled C (matmul dispatches to tuned BLAS) instead of the Python interpreter — and it is shorter once you can read shapes fluently.
Every operation has a predictable output shape: dot product, norm, and MSE give a scalar; matmul $(m,k)\times(k,n)$ gives $(m,n)$ ; the distance matrix is $(N, N)$ ; a batched layer $XW^\top + \mathbf{b}$ is $(N, H)$ . Assert it.
Broadcasting turns per-pair and per-feature loops into single expressions: X[:, None, :] - X[None, :, :] for distances, X - mu for centering. It fails silently when shapes are accidentally compatible, so assert output shapes.
The recurring bug classes are (n,) vs (n, 1), a missing or wrong axis= in mean/std, in-place mutation through a view, and integer-dtype truncation or overflow. Seed with np.random.default_rng(0) for reproducibility and cast to float64 before computing statistics.

Active recall

Answer from memory before checking the lesson:

Write the dot product of $\mathbf{a}, \mathbf{b} \in \mathbb{R}^n$ as an explicit loop and as one NumPy expression. What shape is the result?
For $A$ of shape (3, 4) and $B$ of shape (4, 2), what shape is A @ B? Which dimension must match, and which one disappears?
In a batched layer X @ W.T + b, what are the shapes of X, W, b, and the output? Why must the bias be (H,) and not (H, 1)?
To standardize a feature matrix X of shape (N, D), over which axis do you take the mean and std, and what shape are they? What goes wrong with the other axis?
An engineer computes MSE as np.mean((y[:, None] - yhat[None, :]) ** 2) and gets a suspiciously small number. What shape is the difference, and what single assertion would have caught the bug?

Exercises

Level ARecall & basic calculation

Level AShape reasoningch19-A1

Shape of a matrix product

A has shape (3, 4) and B has shape (4, 2). What is the shape of A @ B? Enter it as a tuple, e.g. (r, c).

Level AHand calculationch19-A2

Dot product by hand

Compute the dot product a @ b for a = np.array([1.0, 2.0, 3.0]) and b = np.array([4.0, 5.0, 6.0]).

Level AHand calculationch19-A3

Mean squared error by hand

For targets y = np.array([3.0, 5.0]) and predictions yhat = np.array([1.0, 4.0]), compute the mean squared error $\frac{1}{N}\sum_i (y_i - \hat y_i)^2$ .

Level AShape reasoningch19-A4

Shape of a per-feature mean

X has shape (100, 4) — 100 examples, 4 features. What is the shape of X.mean(axis=0)? Enter as a tuple, e.g. (k,).

Level AShape reasoningch19-A5

The other axis

Same X of shape (100, 4). What is the shape of X.mean(axis=1)? Enter as a tuple.

Level AEquation interpretationch19-A6

Which expression is the dot product?

For two 1-D arrays a and b of equal length, which expression computes their dot product $\sum_i a_i b_i$ as a single scalar?

Level BConceptual understanding

Level BShape reasoningch19-B1

Shape of a distance matrix

X holds $N$ points as rows, shape (N, D). What is the shape of the pairwise Euclidean distance matrix $D_{\text{mat}}[i,j] = \lVert \mathbf{x}_i - \mathbf{x}_j \rVert$ ?

Level BDebuggingch19-B2

Debug a bias-shaped layer

In a batched layer, X is (N, D), W is (H, D), and an engineer writes b = np.zeros((H, 1)), then Y = X @ W.T + b. The output shape is not (N, H) as expected. What shape does Y actually have, why, and what is the one-character-idea fix?

Level BShape reasoningch19-B3

Shape of the difference block

X has shape (6, 3). What is the shape of X[:, None, :] - X[None, :, :] (the broadcast used to build a distance matrix)? Enter as a tuple, e.g. (a, b, c).

Level BML applicationch19-B4

Why guard the standard deviation?

Standardization computes Z = (X - mu) / sd with sd = X.std(axis=0). Explain in one or two sentences what happens if a feature column is constant, and how the guard sd = np.where(sd > 0, sd, 1.0) prevents it without distorting the data.

Level BEquation interpretationch19-B5

Loop vs vectorized: what actually gets faster?

An engineer replaces a Python for-loop dot product with a @ b and sees a large speedup. Which statement best explains why?

Level CDerivation & implementation

Level CNumPy implementationch19-C1

Implement per-column standardization

Write standardize(X) for a batch X of shape (N, D) that returns (X - mu) / sd using per-column statistics (axis=0), guarding against zero standard deviation. Verify the output has zero mean and unit std per column, assert the output shape equals the input shape, then print ok.

Level CNumPy implementationch19-C2

Batched linear layer: loop vs vectorized

Implement layer_loop(X, W, b) with explicit loops and layer_vec(X, W, b) as X @ W.T + b, for X of shape (N, D), W of shape (H, D), and b of shape (H,). Use a fixed seed, assert both outputs are (N, H) and agree with np.allclose, then print ok.

Level CDebuggingch19-C3

Debug a matmul shape error

An engineer has inputs X of shape (N, D) and weights W of shape (H, D) and writes Y = X @ W. It raises ValueError: matmul: ... (N,D) and (H,D). Explain precisely why matmul rejects these shapes, give the corrected expression, and state the resulting shape.

Level CNumPy implementationch19-C4

Vectorized Euclidean distance matrix

Implement dist_matrix(X) for X of shape (N, D) returning the (N, N) matrix of pairwise Euclidean distances, using broadcasting (no Python loop over pairs). Verify the diagonal is zero and the matrix is symmetric, assert the shape, then print ok.

Level DResearch-thinking challenge

Level DML applicationch19-D1

When vectorization costs too much memory

The broadcast distance matrix materializes an (N, N, D) difference block. For $N = 100{,}000$ and $D = 128$ in float64, estimate that block's memory, explain why the fully vectorized one-liner becomes impractical, and describe a strategy that keeps most of the vectorization speed without allocating the full block.

Level DNumerical experimentch19-D2

Choosing the step size for a central difference

A central difference $\frac{f(x+h) - f(x-h)}{2h}$ has truncation error $O(h^2)$ , which suggests taking $h$ as small as possible. Yet in float64 a tiny $h$ makes the estimate worse. Explain the two competing error sources, why the total error is minimized at an intermediate $h$ (roughly $h \sim \epsilon^{1/3}$ near machine epsilon $\epsilon \approx 2.2\times 10^{-16}$ ), and what this implies for verifying analytic gradients in ML.