NumPy Foundations
Arrays, shapes, dtypes, indexing, and broadcasting
Prerequisites
Learning objectives
- Create arrays and reason precisely about shape and dtype
- Index and slice along axes, including boolean and fancy indexing
- Predict the result shape of a broadcast before running it
- Translate summation/product notation directly into vectorized NumPy
Why NumPy is the substrate of ML
Every model you will ever train spends the overwhelming majority of its runtime
doing one thing: pushing rectangular blocks of numbers through arithmetic. A
batch of images, a table of features, the weights of a layer, the gradients that
update them — all of these are n-dimensional arrays. NumPy is the library
that stores those arrays in contiguous memory and runs the arithmetic in
optimized C. PyTorch, JAX, and TensorFlow all borrow its mental model wholesale:
if you can reason fluently about NumPy shapes, you can read a forward pass in
any of them.
The single skill that separates people who fight their ML code from people who glide through it is shape reasoning — the ability to look at an operation and predict the shape of its output before running it. This chapter builds that skill from the ground up: array creation, the shape tuple, dtypes, indexing, reductions along an axis, and — the crown jewel — broadcasting.
Intuition: an array is a grid with a shape
Forget "matrix" and "tensor" for a moment. A NumPy array is just a grid of numbers plus a shape — a tuple that says how many numbers sit along each axis.
- A single number: shape
()— a 0-D array (a scalar). - A row of numbers: shape
(n,)— a 1-D array (a vector). - A table with rows and columns: shape
(R, C)— a 2-D array. - A batch of such tables: shape
(N, R, C)— a 3-D array.
The numbers themselves live in one flat block of memory; the shape is the lens that tells NumPy how to interpret that block as a grid. Two arrays with the same data but different shapes are different objects — and this is exactly where most bugs live. Reading the shape is like reading the type signature of a function: it tells you what an operation will and will not accept.
Formal-ish definitions
The rank-mismatch case folds into the same rule: if one array has fewer axes, imagine it padded on the left with axes of size until the ranks match, then apply the compatibility test. That single sentence predicts every broadcast you will ever see.
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| Shape tuple: elements per axis | tuple | rank k | fixed | |
| Rank = number of axes | integer | 1 | fixed | |
| Total elements = product of shape | integer | 1 | fixed | |
| Machine type of every element | type | — | fixed | |
| A batch: N examples, D features | matrix | N×D | variable |
Small examples
Predicting a broadcast. Take a column of shape and a row of shape . Align trailing axes: the last axes are and (one is , so the output is ); the first axes are and (one is , so the output is ). Result: . The column is stretched across columns, the row is stretched down rows, and NumPy fills in every combination — an outer operation, with no explicit loop.
A reduction over an axis. For a table of shape , A.sum(axis=0)
adds down the rows, collapsing axis and leaving shape — one total
per column. A.sum(axis=1) adds across the columns, collapsing axis and
leaving shape — one total per row. Rule of thumb: the axis you name is
the axis that disappears.
ML use case: batches, normalization, and bias
Three of the most common lines in any training loop are pure NumPy shape mechanics:
- A batch of inputs is an array of shape : examples stacked along axis , each a -dimensional feature vector along axis .
- Per-feature normalization subtracts a mean vector of shape
from every row:
X - mu. Broadcasting pads to and stretches it down all rows — one subtraction rule applied to the whole batch. - Adding a bias to a layer's output of shape is
Z + bwith of shape : the bias vector is broadcast across all rows.
And the linear layer's aggregation, Z.sum(axis=1), is a reduction — summing
each row's contributions into one number per example. Batches, normalization,
bias, reductions: four ideas, all of them shape reasoning. Let us make them
concrete.
NumPy: creating arrays, shape, dtype
Start with creation and inspection. Notice how .shape, .ndim, and .dtype
answer three different questions about the same block of memory.
Now dtypes, casting, and the two traps that bite hardest — integer overflow and integer vs. true division:
NumPy: indexing, slicing, masks, and fancy indexing
Indexing an -D array takes one index per axis, separated by commas.
Slices (start:stop:step) select ranges; a bare : means "all of this axis". A
boolean mask of the same shape selects the elements where it is True;
fancy indexing with an integer array selects elements by position, in any
order and with repeats.
NumPy: broadcasting, predicted then verified
Here is the headline demo: a column plus a row producing a grid. Read the shape table first, predict the output, then run it.
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| np.arange(3).reshape(3, 1) | array | (3, 1) | input | |
| np.arange(4).reshape(1, 4) | array | (1, 4) | input | |
| 1 vs 4 → one is 1 → stretch to 4 | rule | 4 | derived | |
| 3 vs 1 → one is 1 → stretch to 3 | rule | 3 | derived | |
| column + row (outer sum) | array | (3, 4) | output |
NumPy: reductions along an axis
A reduction collapses an axis into a single value. The axis you name
vanishes from the shape; add keepdims=True to leave a size- placeholder,
which is exactly what you want when the result must broadcast back against the
original.
NumPy: reshape, newaxis, and the (n,) vs (n,1) vs (1,n) distinction
The same 3 numbers can wear three different shapes, and they broadcast
completely differently. reshape and np.newaxis (equivalently None in an
index) are how you move between them.
Warnings
Summary
- A NumPy array is a flat block of numbers plus a shape tuple;
.ndimis the rank,.sizethe element count. Reading the shape is reading the type. - Every array has one dtype. Mixing promotes to the wider type; integers
overflow modularly and
//truncates — both silently. - Index with one entry per axis; slices give views, boolean masks and fancy (integer-array) indexing give copies.
- A reduction over
axis=kdeletes axis from the shape;keepdims=Trueleaves a size- stub for broadcasting back. - Broadcasting: align trailing axes, each pair must be equal or contain a , the size- axis stretches, and the output takes the larger size. Predict the output shape before running.
- Shapes
(n,),(n, 1), and(1, n)are three different objects that broadcast differently — the(n,)-vs-(n,1)confusion is the classic ML shape bug. A batch is(N, D); normalization and bias are broadcasts; aggregation is a reduction.
Active recall
Answer from memory before checking the lesson:
- An array has shape
(4, 1)and another has shape(3,). Do they broadcast? If so, what is the output shape? - For of shape
(10, 5), what shape isA.sum(axis=0)? What aboutA.sum(axis=1)? Which axis "disappears"? - What is
np.array([1, 2, 3]) / 2versusnp.array([1, 2, 3]) // 2— both the values and the dtypes? - Why can adding a
(1, N)array to an(N, 1)array be a bug, and what shape does it produce? - How do you turn a shape
(n,)array into a shape(n, 1)column, two different ways?
Exercises
Level ARecall & basic calculation
Read a shape off a nested list
What is the .shape of np.array([[1, 2, 3], [4, 5, 6]])? Enter it as a tuple, e.g. (2, 3).
Rank of a 3-D array
An array has shape (2, 3, 4). What is its .ndim (its rank, the number of axes)?
Default integer dtype
On a 64-bit platform, what dtype does np.array([1, 2, 3]) get? Enter the dtype name, e.g. int64.
Broadcast a column and a row
What is the output shape of np.arange(3).reshape(3, 1) + np.arange(4).reshape(1, 4)? Enter it as a tuple, e.g. (3, 4).
Shape after a sum over axis 0
For A with shape (2, 3), what is the shape of A.sum(axis=0)? Enter it as a tuple, e.g. (3,).
Integer floor division
What is np.array([1, 2, 3]) // 2? Enter the three resulting values as a, b, c.
Index a 2-D array
Let A = np.arange(12).reshape(3, 4). What single value is A[1, 2]?
Level BConceptual understanding
Broadcast a matrix column against a vector
What is the output shape of an operation between an array of shape (4, 1) and an array of shape (3,)? Enter it as a tuple, e.g. (4, 3).
Which pair fails to broadcast?
Which of the following pairs of shapes cannot be broadcast together?
The (n,) vs (n,1) trap
You have A of shape (5, 3) and a vector v of shape (3,). What does A - v do, and what shape is the result?
Dtype of a mixed sum
a = np.array([1, 2, 3]) (int64) and b = np.array([1.0, 2.0, 3.0]) (float64). What is the dtype of a + b?
Integer overflow
What is the value of (np.array([127], dtype=np.int8) + np.int8(1))[0]?
Level CDerivation & implementation
Per-feature standardization by broadcasting
Given a batch X of shape (N, D), implement standardize(X) that returns (X - mu) / sigma, where mu and sigma are the per-feature (per-column) mean and standard deviation. Use axis and broadcasting — no Python loops. Verify the output has per-feature mean and std , then print ok.
Row-normalize with keepdims
Given a non-negative array A of shape (N, D), implement row_normalize(A) so that each row sums to 1. Use keepdims=True so the row sums broadcast back onto A. Assert each row sums to 1, then print ok.
Predict-then-verify a broadcast shape
Write broadcast_shape(s1, s2) that returns the broadcast output shape (as a tuple) of two shapes, or raises ValueError if they are incompatible — implementing the rule by hand (do not call np.broadcast_shapes). Test that broadcast_shape((2, 1, 3), (4, 3)) == (2, 4, 3) and that (3, 4) with (3,) raises, then print ok.
Boolean mask and fancy indexing
Given A = np.arange(12).reshape(3, 4), use a boolean mask to extract all entries greater than 5 into a 1-D array, and separately use fancy indexing to build a new array from rows [2, 0] in that order. Assert the mask result equals [6, 7, 8, 9, 10, 11] and the fancy result has shape (2, 4), then print ok.
Level DResearch-thinking challenge
Diagnose a silent broadcasting bug
An engineer wants element-wise squared errors between predictions pred and targets true, both length-1000 vectors, and writes err = pred[:, None] - true[None, :], then mse = (err ** 2).mean(). The code runs but the reported MSE is wrong and memory usage spikes. Explain what shape err actually has, why it runs without error, what the correct one-liner is, and what single assertion would have caught the bug immediately.
float32 vs float64 in a training pipeline
Deep-learning frameworks default model weights and activations to float32, while NumPy defaults to float64. Give two concrete reasons float32 is preferred for large-scale training, one concrete risk it introduces, and one place in a pipeline where you would deliberately switch back to float64.