NumPy Foundations

Why NumPy is the substrate of ML

Every model you will ever train spends the overwhelming majority of its runtime doing one thing: pushing rectangular blocks of numbers through arithmetic. A batch of images, a table of features, the weights of a layer, the gradients that update them — all of these are n-dimensional arrays. NumPy is the library that stores those arrays in contiguous memory and runs the arithmetic in optimized C. PyTorch, JAX, and TensorFlow all borrow its mental model wholesale: if you can reason fluently about NumPy shapes, you can read a forward pass in any of them.

The single skill that separates people who fight their ML code from people who glide through it is shape reasoning — the ability to look at an operation and predict the shape of its output before running it. This chapter builds that skill from the ground up: array creation, the shape tuple, dtypes, indexing, reductions along an axis, and — the crown jewel — broadcasting.

Intuition: an array is a grid with a shape

Forget "matrix" and "tensor" for a moment. A NumPy array is just a grid of numbers plus a shape — a tuple that says how many numbers sit along each axis.

A single number: shape () — a 0-D array (a scalar).
A row of $n$ numbers: shape (n,) — a 1-D array (a vector).
A table with $R$ rows and $C$ columns: shape (R, C) — a 2-D array.
A batch of $N$ such tables: shape (N, R, C) — a 3-D array.

The numbers themselves live in one flat block of memory; the shape is the lens that tells NumPy how to interpret that block as a grid. Two arrays with the same data but different shapes are different objects — and this is exactly where most bugs live. Reading the shape is like reading the type signature of a function: it tells you what an operation will and will not accept.

Formal-ish definitions

The rank-mismatch case folds into the same rule: if one array has fewer axes, imagine it padded on the left with axes of size $1$ until the ranks match, then apply the compatibility test. That single sentence predicts every broadcast you will ever see.

Symbol	Meaning	Type	Shape	Role
$(d_0, d_1, \ldots)$	Shape tuple: elements per axis	tuple	rank k	fixed
$\texttt{ndim}$	Rank = number of axes	integer	1	fixed
$\texttt{size}$	Total elements = product of shape	integer	1	fixed
$\texttt{dtype}$	Machine type of every element	type	—	fixed
$(N, D)$	A batch: N examples, D features	matrix	N×D	variable

Small examples

Predicting a broadcast. Take a column of shape $(3, 1)$ and a row of shape $(1, 4)$ . Align trailing axes: the last axes are $1$ and $4$ (one is $1$ , so the output is $4$ ); the first axes are $3$ and $1$ (one is $1$ , so the output is $3$ ). Result: $(3, 4)$ . The column is stretched across $4$ columns, the row is stretched down $3$ rows, and NumPy fills in every $(i, j)$ combination — an outer operation, with no explicit loop.

A reduction over an axis. For a table $A$ of shape $(2, 3)$ , A.sum(axis=0) adds down the rows, collapsing axis $0$ and leaving shape $(3,)$ — one total per column. A.sum(axis=1) adds across the columns, collapsing axis $1$ and leaving shape $(2,)$ — one total per row. Rule of thumb: the axis you name is the axis that disappears.

ML use case: batches, normalization, and bias

Three of the most common lines in any training loop are pure NumPy shape mechanics:

A batch of inputs is an array $X$ of shape $(N, D)$ : $N$ examples stacked along axis $0$ , each a $D$ -dimensional feature vector along axis $1$ .
Per-feature normalization subtracts a mean vector $\mu$ of shape $(D,)$ from every row: X - mu. Broadcasting pads $\mu$ to $(1, D)$ and stretches it down all $N$ rows — one subtraction rule applied to the whole batch.
Adding a bias to a layer's output $Z$ of shape $(N, H)$ is Z + b with $b$ of shape $(H,)$ : the bias vector is broadcast across all $N$ rows.

And the linear layer's aggregation, Z.sum(axis=1), is a reduction — summing each row's contributions into one number per example. Batches, normalization, bias, reductions: four ideas, all of them shape reasoning. Let us make them concrete.

NumPy: creating arrays, shape, dtype

Start with creation and inspection. Notice how .shape, .ndim, and .dtype answer three different questions about the same block of memory.

creation.py

import numpy as np

# From a Python list. Nesting depth becomes rank.
v = np.array([1, 2, 3])              # 1-D
M = np.array([[1, 2, 3],
            [4, 5, 6]])            # 2-D

print("v.shape:", v.shape)           # (3,)
print("v.ndim :", v.ndim)            # 1
print("M.shape:", M.shape)           # (2, 3)
print("M.ndim :", M.ndim)            # 2
print("M.size :", M.size)            # 6

# Common constructors -- shape is passed as a tuple.
zeros = np.zeros((2, 4))             # all 0.0
ones  = np.ones((3,))                # a length-3 vector of 1.0
seq   = np.arange(6)                 # [0 1 2 3 4 5]
grid  = np.arange(6).reshape(2, 3)   # same data, shape (2, 3)

assert zeros.shape == (2, 4)
assert grid.shape == (2, 3)
assert grid.size == v.size * 2

print("token:", "creation ok")

Now dtypes, casting, and the two traps that bite hardest — integer overflow and integer vs. true division:

dtype.py

import numpy as np

ints   = np.array([1, 2, 3])         # inferred int64
floats = np.array([1.0, 2, 3])       # one float makes the whole array float64
print("ints  dtype:", ints.dtype)    # int64
print("floats dtype:", floats.dtype) # float64

# Mixing dtypes promotes to the wider type.
mixed = ints + floats                # int64 + float64 -> float64
print("mixed dtype:", mixed.dtype)   # float64

# Overflow: int8 wraps around at 127 -> -128 (modular arithmetic, no error!).
small = np.array([127], dtype=np.int8)
print("overflow:", (small + np.int8(1))[0])   # -128

# True division always yields float; floor division keeps integers.
print("true  :", ints / 2)           # [0.5 1.  1.5]  float64
print("floor :", ints // 2)          # [0 1 1]        int64

# Explicit casting truncates toward zero for float -> int.
print("cast  :", np.array([1.9, -1.9]).astype(np.int64))  # [ 1 -1]

assert mixed.dtype == np.float64
assert (small + np.int8(1))[0] == -128
print("token:", "dtype ok")

NumPy: indexing, slicing, masks, and fancy indexing

Indexing an $n$ -D array takes one index per axis, separated by commas. Slices (start:stop:step) select ranges; a bare : means "all of this axis". A boolean mask of the same shape selects the elements where it is True; fancy indexing with an integer array selects elements by position, in any order and with repeats.

indexing.py

import numpy as np

A = np.arange(12).reshape(3, 4)
# A =
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

# One index per axis: A[row, col].
print("A[1, 2]:", A[1, 2])           # 6
print("row 0  :", A[0])              # [0 1 2 3]
print("col 3  :", A[:, 3])           # [ 3  7 11]  shape (3,)

# Slicing along axes. A[rows, cols].
block = A[0:2, 1:3]                  # rows 0-1, cols 1-2
print("block shape:", block.shape)   # (2, 2)

# Boolean mask: same-shape True/False grid, returns a 1-D array of matches.
mask = A % 2 == 0                    # even entries
print("evens:", A[mask])             # [ 0  2  4  6  8 10]

# Fancy (integer-array) indexing: pick rows 2, 0, 0 in that order.
print("fancy rows:", A[[2, 0, 0]].shape)   # (3, 4), row 0 repeated

assert A[:, 3].shape == (3,)
assert A[mask].sum() == 30
print("token:", "index ok")

NumPy: broadcasting, predicted then verified

Here is the headline demo: a column $(3, 1)$ plus a row $(1, 4)$ producing a $(3, 4)$ grid. Read the shape table first, predict the output, then run it.

Symbol	Meaning	Type	Shape	Role
$\text{column}$	np.arange(3).reshape(3, 1)	array	(3, 1)	input
$\text{row}$	np.arange(4).reshape(1, 4)	array	(1, 4)	input
$\text{last axis}$	1 vs 4 → one is 1 → stretch to 4	rule	4	derived
$\text{first axis}$	3 vs 1 → one is 1 → stretch to 3	rule	3	derived
$\text{result}$	column + row (outer sum)	array	(3, 4)	output

broadcasting.py

import numpy as np

col = np.arange(3).reshape(3, 1)     # shape (3, 1): [[0],[1],[2]]
row = np.arange(4).reshape(1, 4)     # shape (1, 4): [[0,1,2,3]]

# Predict: trailing axes 1 vs 4 -> 4; leading axes 3 vs 1 -> 3  =>  (3, 4).
out = col + row
print("out shape:", out.shape)       # (3, 4)
print(out)
# [[0 1 2 3]
#  [1 2 3 4]
#  [2 3 4 5]]  -- entry [i, j] = i + j

# ML flavor: per-feature normalization of a batch (N=4, D=3).
rng = np.random.default_rng(0)
X   = rng.standard_normal((4, 3))    # batch: (N, D) = (4, 3)
mu  = X.mean(axis=0)                 # per-feature mean, shape (3,)
Xc  = X - mu                         # (4,3) - (3,) -> mu padded to (1,3), stretched
print("X shape :", X.shape)          # (4, 3)
print("mu shape:", mu.shape)         # (3,)
print("centered mean ~0:", np.allclose(Xc.mean(axis=0), 0.0))

assert out.shape == (3, 4)
assert Xc.shape == (4, 3)
print("token:", "bcast ok")

NumPy: reductions along an axis

A reduction collapses an axis into a single value. The axis you name vanishes from the shape; add keepdims=True to leave a size- $1$ placeholder, which is exactly what you want when the result must broadcast back against the original.

reductions.py

import numpy as np

A = np.array([[1, 2, 3],
            [4, 5, 6]])            # shape (2, 3)

total = A.sum()                      # no axis: everything -> scalar
down  = A.sum(axis=0)                # collapse axis 0 (rows) -> shape (3,)
across = A.sum(axis=1)               # collapse axis 1 (cols) -> shape (2,)

print("total  :", total)             # 21
print("axis=0 :", down, down.shape)  # [5 7 9] (3,)
print("axis=1 :", across, across.shape)  # [ 6 15] (2,)

# keepdims leaves a size-1 axis so the result broadcasts back onto A.
row_sums = A.sum(axis=1, keepdims=True)   # shape (2, 1)
print("keepdims shape:", row_sums.shape)  # (2, 1)
frac = A / row_sums                        # (2,3) / (2,1) -> row-normalized
print("rows sum to 1:", np.allclose(frac.sum(axis=1), 1.0))

assert down.shape == (3,)
assert across.shape == (2,)
assert row_sums.shape == (2, 1)
print("token:", "reduce ok")

NumPy: reshape, newaxis, and the (n,) vs (n,1) vs (1,n) distinction

The same 3 numbers can wear three different shapes, and they broadcast completely differently. reshape and np.newaxis (equivalently None in an index) are how you move between them.

reshape.py

import numpy as np

v = np.arange(3)                     # shape (3,)   -- a bare 1-D array
col = v[:, np.newaxis]               # shape (3, 1) -- a column
row = v[np.newaxis, :]               # shape (1, 3) -- a row

print("v  :", v.shape)               # (3,)
print("col:", col.shape)             # (3, 1)
print("row:", row.shape)             # (1, 3)

# The famous trap: column + row is an OUTER sum, shape (3, 3).
outer = col + row                    # (3,1) + (1,3) -> (3, 3)
print("col + row shape:", outer.shape)   # (3, 3)

# But 1-D + 1-D is an ELEMENT-WISE sum, shape (3,).
elemwise = v + v                     # (3,) + (3,) -> (3,)
print("v + v shape    :", elemwise.shape)  # (3,)

# reshape(-1, 1) infers the missing dimension: turn any vector into a column.
print("reshape:", v.reshape(-1, 1).shape)  # (3, 1)

assert col.shape == (3, 1)
assert row.shape == (1, 3)
assert outer.shape == (3, 3)
assert elemwise.shape == (3,)
print("token:", "reshape ok")

Warnings

Summary

A NumPy array is a flat block of numbers plus a shape tuple; .ndim is the rank, .size the element count. Reading the shape is reading the type.
Every array has one dtype. Mixing promotes to the wider type; integers overflow modularly and // truncates — both silently.
Index with one entry per axis; slices give views, boolean masks and fancy (integer-array) indexing give copies.
A reduction over axis=k deletes axis $k$ from the shape; keepdims=True leaves a size- $1$ stub for broadcasting back.
Broadcasting: align trailing axes, each pair must be equal or contain a $1$ , the size- $1$ axis stretches, and the output takes the larger size. Predict the output shape before running.
Shapes (n,), (n, 1), and (1, n) are three different objects that broadcast differently — the (n,)-vs-(n,1) confusion is the classic ML shape bug. A batch is (N, D); normalization and bias are broadcasts; aggregation is a reduction.

Active recall

Answer from memory before checking the lesson:

An array has shape (4, 1) and another has shape (3,). Do they broadcast? If so, what is the output shape?
For $A$ of shape (10, 5), what shape is A.sum(axis=0)? What about A.sum(axis=1)? Which axis "disappears"?
What is np.array([1, 2, 3]) / 2 versus np.array([1, 2, 3]) // 2 — both the values and the dtypes?
Why can adding a (1, N) array to an (N, 1) array be a bug, and what shape does it produce?
How do you turn a shape (n,) array into a shape (n, 1) column, two different ways?

Exercises

Level ARecall & basic calculation

Level AShape reasoningch18-A1

Read a shape off a nested list

What is the .shape of np.array([[1, 2, 3], [4, 5, 6]])? Enter it as a tuple, e.g. (2, 3).

Level AShape reasoningch18-A2

Rank of a 3-D array

An array has shape (2, 3, 4). What is its .ndim (its rank, the number of axes)?

Level AEquation interpretationch18-A3

Default integer dtype

On a 64-bit platform, what dtype does np.array([1, 2, 3]) get? Enter the dtype name, e.g. int64.

Level AShape reasoningch18-A4

Broadcast a column and a row

What is the output shape of np.arange(3).reshape(3, 1) + np.arange(4).reshape(1, 4)? Enter it as a tuple, e.g. (3, 4).

Level AShape reasoningch18-A5

Shape after a sum over axis 0

For A with shape (2, 3), what is the shape of A.sum(axis=0)? Enter it as a tuple, e.g. (3,).

Level AHand calculationch18-A6

Integer floor division

What is np.array([1, 2, 3]) // 2? Enter the three resulting values as a, b, c.

Level AHand calculationch18-A7

Index a 2-D array

Let A = np.arange(12).reshape(3, 4). What single value is A[1, 2]?

Level BConceptual understanding

Level BShape reasoningch18-B1

Broadcast a matrix column against a vector

What is the output shape of an operation between an array of shape (4, 1) and an array of shape (3,)? Enter it as a tuple, e.g. (4, 3).

Level BShape reasoningch18-B2

Which pair fails to broadcast?

Which of the following pairs of shapes cannot be broadcast together?

Level BShape reasoningch18-B3

The (n,) vs (n,1) trap

You have A of shape (5, 3) and a vector v of shape (3,). What does A - v do, and what shape is the result?

Level BEquation interpretationch18-B4

Dtype of a mixed sum

a = np.array([1, 2, 3]) (int64) and b = np.array([1.0, 2.0, 3.0]) (float64). What is the dtype of a + b?

Level BEquation interpretationch18-B5

Integer overflow

What is the value of (np.array([127], dtype=np.int8) + np.int8(1))[0]?

Level CDerivation & implementation

Level CNumPy implementationch18-C1

Per-feature standardization by broadcasting

Given a batch X of shape (N, D), implement standardize(X) that returns (X - mu) / sigma, where mu and sigma are the per-feature (per-column) mean and standard deviation. Use axis and broadcasting — no Python loops. Verify the output has per-feature mean $\approx 0$ and std $\approx 1$ , then print ok.

Level CNumPy implementationch18-C2

Row-normalize with keepdims

Given a non-negative array A of shape (N, D), implement row_normalize(A) so that each row sums to 1. Use keepdims=True so the row sums broadcast back onto A. Assert each row sums to 1, then print ok.

Level CNumPy implementationch18-C3

Predict-then-verify a broadcast shape

Write broadcast_shape(s1, s2) that returns the broadcast output shape (as a tuple) of two shapes, or raises ValueError if they are incompatible — implementing the rule by hand (do not call np.broadcast_shapes). Test that broadcast_shape((2, 1, 3), (4, 3)) == (2, 4, 3) and that (3, 4) with (3,) raises, then print ok.

Level CNumPy implementationch18-C4

Boolean mask and fancy indexing

Given A = np.arange(12).reshape(3, 4), use a boolean mask to extract all entries greater than 5 into a 1-D array, and separately use fancy indexing to build a new array from rows [2, 0] in that order. Assert the mask result equals [6, 7, 8, 9, 10, 11] and the fancy result has shape (2, 4), then print ok.

Level DResearch-thinking challenge

Level DDebuggingch18-D1

Diagnose a silent broadcasting bug

An engineer wants element-wise squared errors between predictions pred and targets true, both length-1000 vectors, and writes err = pred[:, None] - true[None, :], then mse = (err ** 2).mean(). The code runs but the reported MSE is wrong and memory usage spikes. Explain what shape err actually has, why it runs without error, what the correct one-liner is, and what single assertion would have caught the bug immediately.

Level DML applicationch18-D2

float32 vs float64 in a training pipeline

Deep-learning frameworks default model weights and activations to float32, while NumPy defaults to float64. Give two concrete reasons float32 is preferred for large-scale training, one concrete risk it introduces, and one place in a pipeline where you would deliberately switch back to float64.