Part 5 · NumPy LaboratoryChapter 1875 min

NumPy Foundations

Arrays, shapes, dtypes, indexing, and broadcasting

Learning objectives

  • Create arrays and reason precisely about shape and dtype
  • Index and slice along axes, including boolean and fancy indexing
  • Predict the result shape of a broadcast before running it
  • Translate summation/product notation directly into vectorized NumPy

Why NumPy is the substrate of ML

Every model you will ever train spends the overwhelming majority of its runtime doing one thing: pushing rectangular blocks of numbers through arithmetic. A batch of images, a table of features, the weights of a layer, the gradients that update them — all of these are n-dimensional arrays. NumPy is the library that stores those arrays in contiguous memory and runs the arithmetic in optimized C. PyTorch, JAX, and TensorFlow all borrow its mental model wholesale: if you can reason fluently about NumPy shapes, you can read a forward pass in any of them.

The single skill that separates people who fight their ML code from people who glide through it is shape reasoning — the ability to look at an operation and predict the shape of its output before running it. This chapter builds that skill from the ground up: array creation, the shape tuple, dtypes, indexing, reductions along an axis, and — the crown jewel — broadcasting.

Intuition: an array is a grid with a shape

Forget "matrix" and "tensor" for a moment. A NumPy array is just a grid of numbers plus a shape — a tuple that says how many numbers sit along each axis.

  • A single number: shape () — a 0-D array (a scalar).
  • A row of nn numbers: shape (n,) — a 1-D array (a vector).
  • A table with RR rows and CC columns: shape (R, C) — a 2-D array.
  • A batch of NN such tables: shape (N, R, C) — a 3-D array.

The numbers themselves live in one flat block of memory; the shape is the lens that tells NumPy how to interpret that block as a grid. Two arrays with the same data but different shapes are different objects — and this is exactly where most bugs live. Reading the shape is like reading the type signature of a function: it tells you what an operation will and will not accept.

Formal-ish definitions

The rank-mismatch case folds into the same rule: if one array has fewer axes, imagine it padded on the left with axes of size 11 until the ranks match, then apply the compatibility test. That single sentence predicts every broadcast you will ever see.

Small examples

Predicting a broadcast. Take a column of shape (3,1)(3, 1) and a row of shape (1,4)(1, 4). Align trailing axes: the last axes are 11 and 44 (one is 11, so the output is 44); the first axes are 33 and 11 (one is 11, so the output is 33). Result: (3,4)(3, 4). The column is stretched across 44 columns, the row is stretched down 33 rows, and NumPy fills in every (i,j)(i, j) combination — an outer operation, with no explicit loop.

A reduction over an axis. For a table AA of shape (2,3)(2, 3), A.sum(axis=0) adds down the rows, collapsing axis 00 and leaving shape (3,)(3,) — one total per column. A.sum(axis=1) adds across the columns, collapsing axis 11 and leaving shape (2,)(2,) — one total per row. Rule of thumb: the axis you name is the axis that disappears.

ML use case: batches, normalization, and bias

Three of the most common lines in any training loop are pure NumPy shape mechanics:

  • A batch of inputs is an array XX of shape (N,D)(N, D): NN examples stacked along axis 00, each a DD-dimensional feature vector along axis 11.
  • Per-feature normalization subtracts a mean vector μ\mu of shape (D,)(D,) from every row: X - mu. Broadcasting pads μ\mu to (1,D)(1, D) and stretches it down all NN rows — one subtraction rule applied to the whole batch.
  • Adding a bias to a layer's output ZZ of shape (N,H)(N, H) is Z + b with bb of shape (H,)(H,): the bias vector is broadcast across all NN rows.

And the linear layer's aggregation, Z.sum(axis=1), is a reduction — summing each row's contributions into one number per example. Batches, normalization, bias, reductions: four ideas, all of them shape reasoning. Let us make them concrete.

NumPy: creating arrays, shape, dtype

Start with creation and inspection. Notice how .shape, .ndim, and .dtype answer three different questions about the same block of memory.

creation.py

Now dtypes, casting, and the two traps that bite hardest — integer overflow and integer vs. true division:

dtype.py

NumPy: indexing, slicing, masks, and fancy indexing

Indexing an nn-D array takes one index per axis, separated by commas. Slices (start:stop:step) select ranges; a bare : means "all of this axis". A boolean mask of the same shape selects the elements where it is True; fancy indexing with an integer array selects elements by position, in any order and with repeats.

indexing.py

NumPy: broadcasting, predicted then verified

Here is the headline demo: a column (3,1)(3, 1) plus a row (1,4)(1, 4) producing a (3,4)(3, 4) grid. Read the shape table first, predict the output, then run it.

broadcasting.py

NumPy: reductions along an axis

A reduction collapses an axis into a single value. The axis you name vanishes from the shape; add keepdims=True to leave a size-11 placeholder, which is exactly what you want when the result must broadcast back against the original.

reductions.py

NumPy: reshape, newaxis, and the (n,) vs (n,1) vs (1,n) distinction

The same 3 numbers can wear three different shapes, and they broadcast completely differently. reshape and np.newaxis (equivalently None in an index) are how you move between them.

reshape.py

Warnings

Summary

  • A NumPy array is a flat block of numbers plus a shape tuple; .ndim is the rank, .size the element count. Reading the shape is reading the type.
  • Every array has one dtype. Mixing promotes to the wider type; integers overflow modularly and // truncates — both silently.
  • Index with one entry per axis; slices give views, boolean masks and fancy (integer-array) indexing give copies.
  • A reduction over axis=k deletes axis kk from the shape; keepdims=True leaves a size-11 stub for broadcasting back.
  • Broadcasting: align trailing axes, each pair must be equal or contain a 11, the size-11 axis stretches, and the output takes the larger size. Predict the output shape before running.
  • Shapes (n,), (n, 1), and (1, n) are three different objects that broadcast differently — the (n,)-vs-(n,1) confusion is the classic ML shape bug. A batch is (N, D); normalization and bias are broadcasts; aggregation is a reduction.

Active recall

Answer from memory before checking the lesson:

  1. An array has shape (4, 1) and another has shape (3,). Do they broadcast? If so, what is the output shape?
  2. For AA of shape (10, 5), what shape is A.sum(axis=0)? What about A.sum(axis=1)? Which axis "disappears"?
  3. What is np.array([1, 2, 3]) / 2 versus np.array([1, 2, 3]) // 2 — both the values and the dtypes?
  4. Why can adding a (1, N) array to an (N, 1) array be a bug, and what shape does it produce?
  5. How do you turn a shape (n,) array into a shape (n, 1) column, two different ways?

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge