Mathematical Foundations for Machine Learning and AI

Level AHand calculationch01-A1

Evaluate with precedence

Evaluate the expression $10 - 2 \times 3^2$ by hand, applying operator precedence.

Level AHand calculationch01-A2

The unary minus trap

Evaluate $-3^2$ . (This is the classic precedence trap — the exponent binds tighter than the leading minus.)

Level AHand calculationch01-A3

Subtract two fractions

Compute $\dfrac{2}{3} - \dfrac{1}{6}$ and give the result as a decimal.

Level AEquation interpretationch01-A4

Smallest set containing a number

What is the smallest of the sets $\mathbb{N}, \mathbb{Z}, \mathbb{Q}, \mathbb{R}$ that contains the number $-\dfrac{7}{2}$ ?

Level AShape reasoningch01-A5

Infer the NumPy dtype

In NumPy, what is the dtype of np.array([1, 2, 3])? Answer with the exact dtype name (e.g. int64 or float64).

Level AHand calculationch01-A6

Floor division

What does the floor-division expression 17 // 5 evaluate to in Python/NumPy?

Level BML applicationch01-B1

Which quantity must be a float?

In a training loop, which of the following quantities should be stored as a float rather than an integer?

Level BML applicationch01-B2

Integers cannot hold weights

Explain, in one or two sentences, why initializing a weight array with dtype=int and then applying a fractional gradient update leaves the weights unchanged.

Level BShape reasoningch01-B3

Default dtype of zeros

What is the dtype of the array produced by np.zeros(3) (with no dtype argument)?

Level BEquation interpretationch01-B4

Reading a subscripted sum

In the expression $\displaystyle\sum_{i=1}^{n} w_i x_i$ , the index $i$ and the symbols $w_i, x_i$ come from different number worlds. Which set does the index $i$ range over, and which set do the values $w_i, x_i$ belong to?

Level CNumPy implementationch01-C1

Fix the accuracy bug

A metric function reports $0$ accuracy when it should report $0.75$ . The buggy line is acc = correct // total with correct = 3, total = 4. Write a corrected accuracy(correct, total) that returns a float, verify it gives 0.75 for these inputs, and print ok.

Level CNumPy implementationch01-C2

Hand-evaluate, then check in NumPy

By hand, evaluate $E = 2 + 3 \times 4^2 - \dfrac{10}{2}$ . Then confirm the value in NumPy, print the dtype you observe for the result, and print ok.

Level CNumerical experimentch01-C3

When 0.1 + 0.2 is not 0.3

Reals in $\mathbb{R}$ are exact, but float64 is not. Write NumPy code showing that 0.1 + 0.2 == 0.3 is False, then show that np.isclose(0.1 + 0.2, 0.3) is True, and print ok. In a comment, explain in one line why exact equality fails.

Level DPaper-reading practicech01-D1

Why dtype choice is a modeling decision

Modern deep learning increasingly trains in float16 / bfloat16 (16-bit) rather than float64. Give one concrete benefit of the lower-precision float, one concrete risk it introduces, and explain why an integer dtype is nonetheless still the right choice for token IDs and class labels. Tie each point back to the $\mathbb{Z}$ -labels-versus- $\mathbb{R}$ -measures distinction.

Level AHand calculationch02-A1

Solve a one-step linear equation

Solve $3x + 5 = 20$ for $x$ .

Level AHand calculationch02-A2

Variables on both sides

Solve $5x - 3 = 2x + 9$ for $x$ .

Level AHand calculationch02-A3

Solve a linear inequality

Solve $2x - 6 > 0$ and give the threshold value of $x$ (the boundary of the solution interval).

Level AHand calculationch02-A4

The sign-flip rule

Solve $-2x > 6$ and give the boundary value of $x$ . (Remember what happens to the relation when you divide by a negative.)

Level AEquation interpretationch02-A5

Rearrange the line for the intercept

The line is $y = mx + b$ . Solve for the intercept $b$ in terms of $y$ , $m$ , and $x$ .

Level AEquation interpretationch02-A6

Formula from ax + b = 0

For the general linear equation $ax + b = 0$ with $a \neq 0$ , what is the unique solution $x$ ?

Level BEquation interpretationch02-B1

When does the relation flip?

Which single operation, applied to both sides of an inequality, reverses its direction (e.g. turns $<$ into $>$ )?

Level BML applicationch02-B2

Constraints as inequalities

In machine learning, a learning rate must satisfy $\eta > 0$ and a predicted probability must satisfy $0 \le p \le 1$ . In one or two sentences, explain why these are written as inequalities rather than equations, and what would go wrong if a value violated its constraint.

Level BShape reasoningch02-B3

Inequality becomes a boolean mask

In NumPy, p is a 1-D array of shape $(6,)$ . What does the expression p >= 0.5 evaluate to?

Level BML applicationch02-B4

Translate a word problem

A batch has $n$ examples. Each epoch processes the whole batch once, and you want to run enough epochs that the model sees at least $10{,}000$ examples total. Write an inequality for the number of epochs $E$ in terms of $n$ , then solve it for $E$ .

Level CNumPy implementationch02-C1

Verify a hand solution in NumPy

You solved $7x - 4 = 3x + 12$ by hand and got $x = 4$ . Write NumPy code that substitutes $x = 4$ into both sides, asserts they agree with np.isclose, and prints ok.

Level CDerivationch02-C2

Rearrange and derive: solve for the parameter

The standardization (z-score) formula is $z = \dfrac{x - \mu}{\sigma}$ with $\sigma > 0$ . Derive the formula that recovers the original value $x$ from a given $z$ , showing each balanced move.

Level CNumPy implementationch02-C3

Enforce a probability constraint with a mask

Generate 8 values with a fixed seed via rng = np.random.default_rng(0) and raw = rng.standard_normal(8) (these can fall outside $[0,1]$ ). Build a boolean mask of which entries already satisfy $0 \le \text{raw} \le 1$ , then use np.clip to force all entries into $[0,1]$ . Assert every clipped value satisfies the constraint and print ok.

Level DML applicationch02-D1

A system of constraints defines a feasible region

A tiny training budget imposes two simultaneous constraints on the number of training steps $s$ : memory allows $s \le 1000$ , and you need enough steps to converge, $s \ge 200$ . (1) Describe the set of valid $s$ as an interval. (2) Now suppose someone also requires $s \ge 1200$ . Explain what happens to the feasible region and what that means practically. (3) Relate this to why an optimization problem can be infeasible.

Level AHand calculationch03-A1

Evaluate a logarithm

Evaluate $\log_{2} 8$ . (Read it as: ' $2$ to what power gives $8$ ?')

Level AHand calculationch03-A2

Product law of exponents

Use $b^{m} b^{n} = b^{m+n}$ to evaluate $2^{3} \cdot 2^{4}$ as a single integer.

Level AHand calculationch03-A3

Root as a fractional power

Write $\sqrt[3]{x^{2}}$ as a single power $x^{p}$ . Enter the exponent $p$ as a decimal.

Level AHand calculationch03-A4

A base-10 logarithm

Evaluate $\log_{10} 1000$ .

Level AHand calculationch03-A5

exp and ln are inverses

Evaluate $\ln\!\left(e^{5}\right)$ .

Level AHand calculationch03-A6

Negative exponent

Evaluate $2^{-3}$ as a decimal.

Level BEquation interpretationch03-B1

Exponential to log form

The statement $2^{5} = 32$ is written in exponential form. Which is its correct logarithmic form?

Level BEquation interpretationch03-B2

Spot the invalid log identity

Which of the following is not a valid logarithm identity?

Level BHand calculationch03-B3

Change of base, numerically

Using change of base $\log_{b} x = \dfrac{\ln x}{\ln b}$ with $\ln 10 \approx 2.302585$ and $\ln 2 \approx 0.693147$ , evaluate $\log_{2} 10$ . Give three decimals.

Level BML applicationch03-B4

Why maximize log-likelihood?

Training maximizes the log-likelihood $\sum_i \ln p_i$ rather than the raw likelihood $\prod_i p_i$ . Give two distinct reasons this substitution is safe (same optimum) and better.

Level BEquation interpretationch03-B5

Reading the log-likelihood equation

A model reports a total log-likelihood of $-23000$ over a dataset. Which statement is the best interpretation?

Level CNumPy implementationch03-C1

Implement a stable log-likelihood

Write log_likelihood(p) that returns $\sum_i \ln p_i$ for a 1-D array of probabilities. Show that for $5000$ tiny probabilities the naive product np.prod(p) underflows to 0.0 while your log-space sum stays finite, then print ok.

Level CDerivationch03-C2

Derive the quotient rule

Derive the quotient rule $\log_{b}\!\left(\tfrac{x}{y}\right) = \log_{b} x - \log_{b} y$ starting from an exponent law, for $x, y > 0$ .

Level CNumPy implementationch03-C3

Numerically stable log-sum-exp

Implement logsumexp(z) computing $\ln\sum_k e^{z_k}$ with the max-subtraction trick, and show it agrees with the naive formula on safe inputs but does not overflow on large logits like z = [1000, 1001, 1002]. Print ok.

Level DPaper-reading practicech03-D1

Why subtract the max in log-sum-exp?

The log-sum-exp identity $\ln\sum_k e^{z_k} = m + \ln\sum_k e^{z_k - m}$ holds for any constant $m$ . Prove the identity algebraically, then explain why the specific choice $m = \max_k z_k$ is the numerically safe one — addressing both overflow and underflow.

Level DPaper-reading practicech03-D2

Cross-entropy as negative log-likelihood

For a classifier that outputs probability $p_i$ on the correct class of example $i$ , the average cross-entropy loss is $\mathcal{L} = -\tfrac{1}{N}\sum_i \ln p_i$ . Explain why minimizing $\mathcal{L}$ is the same as maximizing the likelihood $\prod_i p_i$ , and interpret the per-example loss $-\ln p_i$ as an amount of 'surprise' — including what happens as $p_i \to 0$ and as $p_i \to 1$ .

Level AHand calculationch04-A1

Expand and evaluate a sum

Expand and compute $\sum_{i=1}^{4} (2i - 1)$ .

Level AHand calculationch04-A2

Count the terms

How many terms does $\sum_{i=3}^{9} f(i)$ have?

Level AHand calculationch04-A3

Evaluate a double sum

For $A = \begin{pmatrix} 2 & 1 \\ 0 & 3 \end{pmatrix}$ , compute $\sum_{i=1}^{2}\sum_{j=1}^{2} A_{ij}$ .

Level AHand calculationch04-A4

Evaluate a product

Compute $\prod_{i=1}^{4} i$ .

Level AEquation interpretationch04-A5

argmax returns an index

A classifier outputs probabilities $p = (0.2, 0.5, 0.3)$ over classes indexed $0, 1, 2$ . What is $\arg\max_k p_k$ (use 0-based indexing)?

Level AEquation interpretationch04-A6

Read a superscript index

In the standard ML convention, what does $x^{(3)}$ denote?

Level AEquation interpretationch04-A7

Cardinality of a training set

A training set is written $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{50}$ . What is $|\mathcal{D}|$ ?

Level BEquation interpretationch04-B1

Reading a conditional sum

What does $\sum_{i \,:\, y_i = 1} x_i$ compute?

Level BEquation interpretationch04-B2

argmin gives parameters, not the loss

Training is written $\theta^\star = \arg\min_{\theta} L(\theta)$ . What is $\theta^\star$ ?

Level BEquation interpretationch04-B3

Superscript: example or power?

You read $x^{(2)}_j$ in a paper. Which statement is correct?

Level BEquation interpretationch04-B4

Which index does softmax sum over?

In $p_k = \dfrac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$ , which is true about the indices $k$ and $j$ ?

Level BEquation interpretationch04-B5

Drop the constants: Big-O

An algorithm runs in $\tfrac{1}{2}n^2 + 5n + 40$ operations. Its running time is:

Level CNumPy implementationch04-C1

Implement MSE from its sum

Translate $L = \frac{1}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)^2$ into a function mse(yhat, y) using NumPy. Verify it on $\hat{y} = (2, 0, 3)$ , $y = (1, 0, 5)$ (which should give $5/3$ ), then print ok.

Level CNumPy implementationch04-C2

Softmax then argmax

Implement softmax(z) returning $p_k = e^{z_k}/\sum_j e^{z_j}$ , confirm the output sums to $1$ , and report the predicted class via np.argmax. Use $z = (2, 0, 1)$ and print ok.

Level CNumPy implementationch04-C3

A coupled double sum is a matmul

The layer score is $z_i = \sum_{j=1}^{d} W_{ij} x_j$ . Implement it two ways — an explicit double loop and W @ x — for a random $W$ of shape $(4, 3)$ and $x$ of length $3$ (fixed seed). Assert they agree and print match. In a comment, state the Big-O cost.

Level CDerivationch04-C4

Pull a constant out of a sum

Prove the linearity fact $\sum_{i=1}^{n} c\,a_i = c \sum_{i=1}^{n} a_i$ for any scalar constant $c$ , and explain why this justifies writing MSE with the $\frac{1}{n}$ outside the sum.

Level DML applicationch04-D1

Reason about scaling from Big-O

A paper reports that a dense layer costs $O(n \cdot d)$ (for $n$ tokens of dimension $d$ ) while self-attention costs $O(n^2 \cdot d)$ . For a fixed $d$ , if you double the sequence length $n$ , by what factor does each cost grow? Then explain, using only Big-O reasoning, why attention becomes the bottleneck for long sequences even though both are 'polynomial'.

Level DPaper-reading practicech04-D2

Decode an unfamiliar loss

Using only the notation from this chapter, decode the (binary) cross-entropy loss $L = -\frac{1}{n}\sum_{i=1}^{n}\bigl[y_i \log \hat{y}_i + (1 - y_i)\log(1 - \hat{y}_i)\bigr]$ . Identify each symbol and its type, expand the summand for a single example with label $y_i = 1$ , and state in one sentence what the loss rewards.

Level AHand calculationch05-A1

Evaluate a function

Let $f(x) = 3x - 4$ . Compute $f(5)$ .

Level AHand calculationch05-A2

Compose at a point

Let $f(x) = x + 3$ and $g(x) = 2x$ . Compute $(g \circ f)(4)$ .

Level AHand calculationch05-A3

Order matters

For $f(x) = x + 3$ and $g(x) = 2x$ , compute $(f \circ g)(4)$ . (Compare with A2, where $(g \circ f)(4) = 14$ .)

Level AHand calculationch05-A4

Find an inverse value

The function $f(x) = 2x + 1$ has inverse $f^{-1}(y) = \dfrac{y - 1}{2}$ . Compute $f^{-1}(9)$ .

Level AEquation interpretationch05-A5

Domain of a reciprocal

For $f(x) = \dfrac{1}{x - 2}$ over the real numbers, which single value of $x$ must be excluded from the domain?

Level AEquation interpretationch05-A6

Range of a square

For $g(x) = x^2$ with domain all real numbers, what is the smallest value in the range?

Level BGraph interpretationch05-B1

Is it a function?

A relation is graphed in the plane. Which single test decides whether it defines $y$ as a function of $x$ ?

Level BEquation interpretationch05-B2

Codomain versus range

Explain in one or two sentences the difference between the codomain and the range of a function $f: X \to Y$ , using $f(x) = x^2$ with $X = Y = \mathbb{R}$ as an example.

Level BEquation interpretationch05-B3

Which function is invertible?

Each function below has domain all of $\mathbb{R}$ . Which one is invertible (one-to-one)?

Level BML applicationch05-B4

Inputs of a model and a loss

A model is written $\hat{y} = f(x; \theta)$ and its loss $L(\theta)$ . In one or two sentences, say which quantity is held fixed and which varies (a) at prediction time and (b) at training time, and why $L$ is written as a function of $\theta$ alone.

Level CDerivationch05-C1

Compose two functions symbolically

Let $f(x) = 3x - 1$ and $g(x) = x^2 + 2$ . Derive closed-form expressions for $(g \circ f)(x)$ and $(f \circ g)(x)$ , and confirm they are different functions.

Level CDerivationch05-C2

Derive an inverse

The function $f(x) = \dfrac{x - 4}{3}$ is one-to-one on $\mathbb{R}$ . Derive its inverse $f^{-1}(y)$ and verify $f^{-1}(f(x)) = x$ .

Level CNumPy implementationch05-C3

Composition on a grid in NumPy

Implement $f(x) = x + 1$ and $g(x) = x^2$ as Python functions. Using np.linspace(-3, 3, 7), evaluate $g \circ f$ and $f \circ g$ on the grid, confirm with np.allclose that they are not equal everywhere, and print ok.

Level DPaper-reading practicech05-D1

Why deep networks are compositions

A network layer is a function $f_\ell$ , and an $L$ -layer network is the composition $f_L \circ \cdots \circ f_1$ . (a) Explain why stacking only linear layers $f_\ell(x) = W_\ell x$ gains no expressive power over a single linear layer. (b) Explain what a nonlinear activation inserted between layers changes. (c) Connect the composition structure to why the chain rule is central to training.

Level AHand calculationch06-A1

Sigmoid at zero

Compute $\sigma(0)$ for the sigmoid $\sigma(x) = \dfrac{1}{1 + e^{-x}}$ .

Level AHand calculationch06-A2

ReLU of a negative input

Compute $\mathrm{ReLU}(-3) = \max(0, -3)$ .

Level AHand calculationch06-A3

ReLU of a positive input

Compute $\mathrm{ReLU}(5) = \max(0, 5)$ .

Level AEquation interpretationch06-A4

Range of tanh

What is the range of $\tanh(x)$ ?

Level AHand calculationch06-A5

Leaky ReLU on a negative input

Using the course's Leaky ReLU with leak $\alpha = 0.1$ , so $\mathrm{LReLU}(x) = x$ for $x \ge 0$ and $\alpha x$ for $x < 0$ , compute $\mathrm{LReLU}(-4)$ .

Level AEquation interpretationch06-A6

Maximum slope of the sigmoid

The sigmoid derivative is $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ . What is its maximum value?

Level BEquation interpretationch06-B1

Which activation is zero-centered?

Which of these hidden-layer activations is zero-centered (outputs symmetric about $0$ )?

Level BGraph interpretationch06-B2

Match the graph to the function

A plotted curve is $0$ for all negative inputs, then rises as a straight line of slope $1$ for positive inputs, with a sharp corner at the origin. Which function is it?

Level BShape reasoningch06-B3

Range of softplus

What is the range of softplus $\zeta(x) = \ln(1 + e^{x})$ ?

Level BML applicationch06-B4

Saturation and vanishing gradients

In a deep network of sigmoid layers, explain why pushing many neurons into their saturated regions (large $|x|$ ) makes the early layers train slowly. Reference the chain rule and the size of $\sigma'(x)$ .

Level BML applicationch06-B5

Sigmoid output, ReLU hidden

A binary classifier commonly uses ReLU in its hidden layers but a sigmoid on the final output neuron. Give the reason for each choice in one sentence.

Level CNumPy implementationch06-C1

Implement ReLU and Leaky ReLU

Implement vectorized relu(x) and leaky_relu(x, alpha=0.1) for a 1-D NumPy array. Verify on x = np.array([-2.0, 0.0, 3.0]) that ReLU gives [0, 0, 3] and Leaky ReLU gives [-0.2, 0, 3], then print ok.

Level CNumPy implementationch06-C2

Numerically stable softplus

Implement softplus(x) that does not overflow for large inputs, and confirm softplus(1000.0) is finite (and $\approx 1000$ ) while the naive np.log(1 + np.exp(1000.0)) is inf. Also check numerically that the derivative of softplus equals the sigmoid. Print ok.

Level CDerivationch06-C3

Derive the sigmoid's derivative

Show that $\sigma'(x) = \sigma(x)\,(1 - \sigma(x))$ for $\sigma(x) = (1 + e^{-x})^{-1}$ .

Level CNumerical experimentch06-C4

Simulate a vanishing gradient

Empirically show gradient vanishing: multiply together the sigmoid derivatives $\sigma'(z)$ across a stack of layers whose pre-activations are all saturated (e.g. $z = 6$ ). Compare the product for a 20-layer sigmoid stack against a 20-layer ReLU stack (per-layer derivative $1$ for $z > 0$ ). Print both products, assert the sigmoid product is far smaller, and print ok.

Level DPaper-reading practicech06-D1

Why did ReLU help deep networks?

Before ~2011, deep networks used sigmoid/tanh and were notoriously hard to train past a few layers; switching to ReLU was a turning point. Explain the main mechanism by which ReLU improved trainability of deep nets, name at least one additional practical benefit, and state one drawback ReLU introduced (with a named remedy).

Level DPaper-reading practicech06-D2

Choosing a positivity-preserving output activation

You are designing a network whose scalar output must be strictly positive (e.g. it predicts the variance $v$ of a Gaussian, or a rate parameter). Compare using $\mathrm{ReLU}$ , $\exp$ , and softplus $\zeta$ as the final activation. Which would you choose and why? Address the range, gradient behavior, and numerical stability of each.

Level AShape reasoningch18-A1

Read a shape off a nested list

What is the .shape of np.array([[1, 2, 3], [4, 5, 6]])? Enter it as a tuple, e.g. (2, 3).

Level AShape reasoningch18-A2

Rank of a 3-D array

An array has shape (2, 3, 4). What is its .ndim (its rank, the number of axes)?

Level AEquation interpretationch18-A3

Default integer dtype

On a 64-bit platform, what dtype does np.array([1, 2, 3]) get? Enter the dtype name, e.g. int64.

Level AShape reasoningch18-A4

Broadcast a column and a row

What is the output shape of np.arange(3).reshape(3, 1) + np.arange(4).reshape(1, 4)? Enter it as a tuple, e.g. (3, 4).

Level AShape reasoningch18-A5

Shape after a sum over axis 0

For A with shape (2, 3), what is the shape of A.sum(axis=0)? Enter it as a tuple, e.g. (3,).

Level AHand calculationch18-A6

Integer floor division

What is np.array([1, 2, 3]) // 2? Enter the three resulting values as a, b, c.

Level AHand calculationch18-A7

Index a 2-D array

Let A = np.arange(12).reshape(3, 4). What single value is A[1, 2]?

Level BShape reasoningch18-B1

Broadcast a matrix column against a vector

What is the output shape of an operation between an array of shape (4, 1) and an array of shape (3,)? Enter it as a tuple, e.g. (4, 3).

Level BShape reasoningch18-B2

Which pair fails to broadcast?

Which of the following pairs of shapes cannot be broadcast together?

Level BShape reasoningch18-B3

The (n,) vs (n,1) trap

You have A of shape (5, 3) and a vector v of shape (3,). What does A - v do, and what shape is the result?

Level BEquation interpretationch18-B4

Dtype of a mixed sum

a = np.array([1, 2, 3]) (int64) and b = np.array([1.0, 2.0, 3.0]) (float64). What is the dtype of a + b?

Level BEquation interpretationch18-B5

Integer overflow

What is the value of (np.array([127], dtype=np.int8) + np.int8(1))[0]?

Level CNumPy implementationch18-C1

Per-feature standardization by broadcasting

Given a batch X of shape (N, D), implement standardize(X) that returns (X - mu) / sigma, where mu and sigma are the per-feature (per-column) mean and standard deviation. Use axis and broadcasting — no Python loops. Verify the output has per-feature mean $\approx 0$ and std $\approx 1$ , then print ok.

Level CNumPy implementationch18-C2

Row-normalize with keepdims

Given a non-negative array A of shape (N, D), implement row_normalize(A) so that each row sums to 1. Use keepdims=True so the row sums broadcast back onto A. Assert each row sums to 1, then print ok.

Level CNumPy implementationch18-C3

Predict-then-verify a broadcast shape

Write broadcast_shape(s1, s2) that returns the broadcast output shape (as a tuple) of two shapes, or raises ValueError if they are incompatible — implementing the rule by hand (do not call np.broadcast_shapes). Test that broadcast_shape((2, 1, 3), (4, 3)) == (2, 4, 3) and that (3, 4) with (3,) raises, then print ok.

Level CNumPy implementationch18-C4

Boolean mask and fancy indexing

Given A = np.arange(12).reshape(3, 4), use a boolean mask to extract all entries greater than 5 into a 1-D array, and separately use fancy indexing to build a new array from rows [2, 0] in that order. Assert the mask result equals [6, 7, 8, 9, 10, 11] and the fancy result has shape (2, 4), then print ok.

Level DDebuggingch18-D1

Diagnose a silent broadcasting bug

An engineer wants element-wise squared errors between predictions pred and targets true, both length-1000 vectors, and writes err = pred[:, None] - true[None, :], then mse = (err ** 2).mean(). The code runs but the reported MSE is wrong and memory usage spikes. Explain what shape err actually has, why it runs without error, what the correct one-liner is, and what single assertion would have caught the bug immediately.

Level DML applicationch18-D2

float32 vs float64 in a training pipeline

Deep-learning frameworks default model weights and activations to float32, while NumPy defaults to float64. Give two concrete reasons float32 is preferred for large-scale training, one concrete risk it introduces, and one place in a pipeline where you would deliberately switch back to float64.

Level AHand calculationch07-A1

Compute a dot product

Let $\mathbf{a} = (2, -1, 3)$ and $\mathbf{b} = (4, 5, -2)$ . Compute $\mathbf{a} \cdot \mathbf{b}$ .

Level AHand calculationch07-A2

Vector addition

Compute $\mathbf{a} + \mathbf{b}$ for $\mathbf{a} = (1, 2)$ and $\mathbf{b} = (3, -5)$ . Enter the result as x, y.

Level AHand calculationch07-A3

Scalar multiplication

Compute $-2\,\mathbf{v}$ for $\mathbf{v} = (3, -1, 0)$ . Enter as x, y, z.

Level AHand calculationch07-A4

Length of a vector

Compute the Euclidean length $\lVert \mathbf{a} \rVert = \sqrt{\mathbf{a}\cdot\mathbf{a}}$ for $\mathbf{a} = (3, 4)$ .

Level AHand calculationch07-A5

A linear combination

With $\mathbf{v}_1 = (1, 0)$ and $\mathbf{v}_2 = (0, 1)$ , compute $3\mathbf{v}_1 + (-2)\mathbf{v}_2$ . Enter as x, y.

Level AEquation interpretationch07-A6

Orthogonality check

Two vectors are orthogonal when their dot product is which value?

Level BEquation interpretationch07-B1

Sign of the dot product

Two nonzero vectors have a negative dot product. Which is true about the angle $\theta$ between them?

Level BML applicationch07-B2

Why the weighted sum?

In a neuron $z = \mathbf{w}^\top\mathbf{x} + b$ , explain in one or two sentences what a large positive weight $w_i$ means about feature $x_i$ 's influence on $z$ , and what a weight near zero means.

Level BShape reasoningch07-B3

Shape of a dot product

If $\mathbf{a}, \mathbf{b} \in \mathbb{R}^{768}$ (say, two word embeddings), what is the shape of $\mathbf{a}\cdot\mathbf{b}$ ?

Level BProof-style reasoningch07-B4

Commutativity of the dot product

Show that the dot product is commutative: $\mathbf{a}\cdot\mathbf{b} = \mathbf{b}\cdot\mathbf{a}$ for all $\mathbf{a},\mathbf{b}\in\mathbb{R}^n$ .

Level CNumPy implementationch07-C1

Implement cosine similarity

Implement cosine_similarity(a, b) for two 1-D NumPy arrays, returning $\dfrac{\mathbf{a}\cdot\mathbf{b}}{\lVert a\rVert\,\lVert b\rVert}$ . Verify it returns approximately 1.0 for two parallel vectors, then print ok.

Level CNumPy implementationch07-C2

Loop vs vectorized dot product

Write dot_loop(a, b) (an explicit Python loop) and dot_vec(a, b) (using @). Generate a, b of length 100000 with a fixed seed, confirm the results agree with np.isclose, and print match. In a comment, state which is faster and why.

Level CDerivationch07-C3

Derive the projection formula

The projection of $\mathbf{a}$ onto $\mathbf{b}$ is the vector $\mathbf{p} = t\,\mathbf{b}$ such that the error $\mathbf{a} - \mathbf{p}$ is orthogonal to $\mathbf{b}$ . Derive $t$ .

Level DPaper-reading practicech07-D1

Why cosine, not dot product, for similarity?

Retrieval systems and embedding papers usually rank by cosine similarity rather than the raw dot product. Give a concrete example of vectors where the raw dot product is misleading but cosine is not, explain what property cosine adds, then name one situation where the raw dot product is the right choice.

Level AHand calculationch12-A1

Compute an L2 norm

Compute the L2 (Euclidean) norm $\lVert\mathbf{x}\rVert_2$ of $\mathbf{x} = (3, 4)$ .

Level AHand calculationch12-A2

Compute an L1 norm

Compute the L1 (Manhattan) norm $\lVert\mathbf{x}\rVert_1$ of $\mathbf{x} = (3, -4)$ .

Level AHand calculationch12-A3

Compute an L∞ norm

Compute the L∞ (max) norm $\lVert\mathbf{x}\rVert_\infty$ of $\mathbf{x} = (-2, 5, -7, 1)$ .

Level AHand calculationch12-A4

Euclidean distance between two points

Compute the Euclidean distance $d(\mathbf{a}, \mathbf{b})$ between $\mathbf{a} = (1, 2)$ and $\mathbf{b} = (4, 6)$ .

Level AHand calculationch12-A5

Cosine of orthogonal vectors

Compute the cosine similarity of $\mathbf{a} = (1, 0)$ and $\mathbf{b} = (0, 3)$ .

Level AHand calculationch12-A6

Cosine of parallel vectors

Compute the cosine similarity of $\mathbf{a} = (2, 1)$ and $\mathbf{b} = (6, 3)$ .

Level BEquation interpretationch12-B1

Ordering of the three norms

For any vector $\mathbf{x}$ , which chain of inequalities among its L1, L2, and L∞ norms always holds?

Level BML applicationch12-B2

Which penalty produces sparsity?

You want a regression model that sets many weights to exactly zero for feature selection. Do you add an L1 or an L2 penalty to the loss, and what is the resulting method called?

Level BEquation interpretationch12-B3

Distance versus difference of norms

A colleague computes the 'distance' between $\mathbf{a} = (3, 0)$ and $\mathbf{b} = (0, 3)$ as $\lVert\mathbf{a}\rVert_2 - \lVert\mathbf{b}\rVert_2$ and gets $0$ , concluding the points coincide. In one or two sentences, explain the error and give the correct Euclidean distance.

Level BML applicationch12-B4

Why cosine for retrieval

A document embedding $\mathbf{d}$ and the same document repeated twice, embedded as roughly $2\mathbf{d}$ , should be judged equally relevant to a query $\mathbf{q}$ . Explain in one or two sentences why cosine similarity gives the same score for $\mathbf{d}$ and $2\mathbf{d}$ but the raw dot product does not.

Level BML applicationch12-B5

Norm versus squared norm in regularization

Ridge regression penalizes $\lVert\mathbf{w}\rVert_2^2$ , not $\lVert\mathbf{w}\rVert_2$ . Give the main reason the squared norm is preferred as the penalty term.

Level CNumPy implementationch12-C1

Implement the general Lp norm

Implement lp_norm(x, p) for a 1-D NumPy array, returning $\left(\sum_i |x_i|^p\right)^{1/p}$ , and handle p = np.inf as the max norm. Verify against np.linalg.norm(x, ord=p) for $p = 1, 2, \infty$ on $\mathbf{x} = (3, -4)$ , then print ok.

Level CNumPy implementationch12-C2

Cosine similarity with a zero-vector guard

Implement cosine(a, b) returning $\dfrac{\mathbf{a}\cdot\mathbf{b}}{\lVert a\rVert_2\,\lVert b\rVert_2}$ , but return 0.0 if either vector is the zero vector (so it never emits nan). Clamp the result to $[-1, 1]$ . Verify it gives 1.0 for parallel vectors, 0.0 for orthogonal ones, and 0.0 for a zero input, then print ok.

Level CDerivationch12-C3

Derive cosine similarity from the dot product

Starting from the geometric form of the dot product, $\mathbf{a}\cdot\mathbf{b} = \lVert\mathbf{a}\rVert_2\,\lVert\mathbf{b}\rVert_2\cos\theta$ , derive the cosine similarity formula and explain why the result must lie in $[-1, 1]$ and why it is scale-invariant.

Level DPaper-reading practicech12-D1

Why L1 produces sparsity but L2 does not

Ridge (L2) and Lasso (L1) both shrink weights, yet only Lasso drives many of them to exactly zero. Using the geometry of the L1 and L2 unit balls in 2-D, explain why the L1 constraint favors solutions on the axes (sparse) while the L2 constraint does not. Then name one concrete situation where you would deliberately prefer L2 over L1.

Level AShape reasoningch08-A1

Shape of a matrix product

$\mathbf{A}$ has shape $(2, 3)$ and $\mathbf{B}$ has shape $(3, 4)$ . What is the shape of $\mathbf{A}\mathbf{B}$ ? Enter as (rows, cols).

Level AHand calculationch08-A2

Matrix–vector product by hand

Let $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ and $\mathbf{x} = (1, 1)$ . Compute $\mathbf{A}\mathbf{x}$ . Enter as y1, y2.

Level AHand calculationch08-A3

Indexing an entry

For $\mathbf{A} = \begin{bmatrix} 2 & 4 & 6 \\ 8 & 10 & 12 \\ 14 & 16 & 18 \end{bmatrix}$ , what is A[1][2] using 0-based indexing (as in NumPy: row first, column second)?

Level AShape reasoningch08-A4

Shape after transpose

$\mathbf{A}$ has shape $(5, 2)$ . What is the shape of $\mathbf{A}^\top$ ? Enter as (rows, cols).

Level AHand calculationch08-A5

Matrix addition

Compute $\mathbf{A} + \mathbf{B}$ for $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ and $\mathbf{B} = \begin{bmatrix} 10 & 20 \\ 30 & 40 \end{bmatrix}$ . Enter the entries row by row: a11, a12, a21, a22.

Level AHand calculationch08-A6

Trace of a matrix

Compute $\operatorname{tr}(\mathbf{A})$ for $\mathbf{A} = \begin{bmatrix} 3 & 1 & 0 \\ 2 & 5 & 7 \\ 1 & 0 & 4 \end{bmatrix}$ .

Level BShape reasoningch08-B1

Tracking shapes through a chain

With $\mathbf{A}$ of shape $(3, 5)$ , $\mathbf{B}$ of shape $(5, 2)$ , and $\mathbf{C}$ of shape $(2, 7)$ , what is the shape of $\mathbf{A}\mathbf{B}\mathbf{C}$ ? Enter as (rows, cols).

Level BEquation interpretationch08-B2

Why AB is not BA

Let $\mathbf{A}$ have shape $(2, 3)$ and $\mathbf{B}$ have shape $(3, 2)$ . Which statement is correct?

Level BHand calculationch08-B3

Ax as a combination of columns

Using the column view, compute $\mathbf{A}\mathbf{x}$ for $\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}$ and $\mathbf{x} = (2, 1)$ : form $x_1\mathbf{a}_1 + x_2\mathbf{a}_2$ . Enter as y1, y2.

Level BEquation interpretationch08-B4

The identity does nothing

For $\mathbf{A}$ of shape $(4, 3)$ and $\mathbf{I}_3$ the $3 \times 3$ identity, what is $\mathbf{A}\mathbf{I}_3$ ?

Level BML applicationch08-B5

The batch layer shape

A dense layer has weights $\mathbf{W}$ of shape $(20, 10)$ (20 outputs, 10 inputs). A batch of data $\mathbf{X}$ has shape $(64, 10)$ (64 examples, each with 10 features). What is the shape of $\mathbf{X}\mathbf{W}^\top$ ? Enter as (rows, cols).

Level CNumPy implementationch08-C1

Implement matmul: loop vs @

Write matmul_loop(A, B) using an explicit triple loop over $i, j, k$ (the entry formula $c_{ij} = \sum_k a_{ik} b_{kj}$ ), and confirm it agrees with A @ B on random matrices with a fixed seed. Assert the output shape is (A.shape[0], B.shape[1]), then print ok.

Level CNumPy implementationch08-C2

Matrix–vector product, both views

Implement matvec_rows(A, x) (stack of row dot products) and matvec_cols(A, x) (weighted sum of columns), and confirm both equal A @ x. Use a fixed seed with $\mathbf{A}$ of shape $(3, 4)$ and $\mathbf{x}$ of length $4$ . Print ok.

Level CNumPy implementationch08-C3

Transpose reverses a product

Verify numerically that $(\mathbf{A}\mathbf{B})^\top = \mathbf{B}^\top \mathbf{A}^\top$ (and that the naive $\mathbf{A}^\top \mathbf{B}^\top$ generally is not equal, and may even be a shape error). Use a fixed seed with $\mathbf{A}$ of shape $(2, 3)$ and $\mathbf{B}$ of shape $(3, 4)$ . Print ok.

Level CNumPy implementationch08-C4

A Gram matrix is symmetric

For any data matrix $\mathbf{X}$ of shape $(N, d)$ , show numerically that the Gram matrix $\mathbf{G} = \mathbf{X}^\top \mathbf{X}$ is square, has shape $(d, d)$ , and is symmetric ( $\mathbf{G} = \mathbf{G}^\top$ ). Use a fixed seed with $N = 5$ , $d = 3$ , and print ok.

Level DShape reasoningch08-D1

Attention, purely by shape

Self-attention computes $\operatorname{softmax}(\mathbf{Q}\mathbf{K}^\top)\mathbf{V}$ , where $\mathbf{Q}, \mathbf{K}, \mathbf{V}$ each have shape $(N, d)$ (here $N$ = sequence length, $d$ = head dimension). Softmax is applied row-wise and does not change shape. Working purely from the inner/outer rule, (a) give the shape of the score matrix $\mathbf{Q}\mathbf{K}^\top$ , (b) give the shape of the final output, and (c) explain in one sentence why the score matrix is $N \times N$ regardless of $d$ .

Level DError identificationch08-D2

Debug a stacked-layer shape chain

An engineer stacks two dense layers on a batch. The data is $\mathbf{X}$ of shape $(64, 784)$ ; layer 1 has weights $\mathbf{W}_1$ of shape $(128, 784)$ ; layer 2 has weights $\mathbf{W}_2$ of shape $(10, 128)$ . They write Z = X @ W1 @ W2 and get a shape error on the first product. Explain why it fails, write the corrected expression using transposes, and give the shape after each matmul.

Level AHand calculationch09-A1

Solve a 2×2 system by elimination

Solve the system

\begin{aligned} 2x + y &= 5 \\ x - y &= 1. \end{aligned}

Enter the solution as x, y.

Level AHand calculationch09-A2

Another 2×2 solve

Solve

\begin{aligned} x + y &= 4 \\ 2x - y &= 5. \end{aligned}

Enter as x, y.

Level AEquation interpretationch09-A3

Write a system in matrix form

The system $3x + 2y = 12,\ x - y = 1$ is written as $A\mathbf{x} = \mathbf{b}$ . What is the right-hand side vector $\mathbf{b}$ ? Enter as b1, b2.

Level AShape reasoningch09-A4

Count the free variables

A consistent system in $n = 3$ unknowns has $\operatorname{rank}(A) = 2$ . How many free variables does its solution set have?

Level AEquation interpretationch09-A5

Reading an inconsistent row

After elimination, a row of the augmented matrix becomes $[\,0\ 0\ 0 \mid 5\,]$ . How many solutions does the system have?

Level AEquation interpretationch09-A6

Determinant and uniqueness

For a square system $A\mathbf{x} = \mathbf{b}$ , a unique solution for every $\mathbf{b}$ is guaranteed exactly when $\det A$ is which value?

Level BGraph interpretationch09-B1

Classify: parallel lines

The two equations of a 2×2 system plot as parallel but distinct lines. How many solutions does the system have?

Level BEquation interpretationch09-B2

Rank decides the outcome

A system has $n = 3$ unknowns with $\operatorname{rank}(A) = \operatorname{rank}([A \mid \mathbf{b}]) = 3$ . Which outcome holds?

Level BShape reasoningch09-B3

Shapes in the normal equations

In least squares the design matrix is $X \in \mathbb{R}^{m \times n}$ with $m$ examples and $n$ features. In the normal equations $X^\top X\,\boldsymbol{\theta} = X^\top \mathbf{y}$ , what is the shape of the coefficient matrix $X^\top X$ ?

Level BML applicationch09-B4

When is XᵀX singular?

Explain, in one or two sentences, why $X^\top X$ becomes singular when two feature columns of $X$ are exactly proportional (perfectly collinear), and what this means for the regression weights.

Level CNumPy implementationch09-C1

Solve a 3×3 system in NumPy

Use np.linalg.solve to solve

\begin{aligned} 2x + y - z &= 8 \\ -3x - y + 2z &= -11 \\ -2x + y + 2z &= -3. \end{aligned}

Verify the solution with np.allclose(A @ x, b), then print ok.

Level CDerivationch09-C2

Elimination to RREF by hand

Solve

\begin{aligned} x + 2y &= 4 \\ 3x + 4y &= 10 \end{aligned}

by Gaussian elimination on the augmented matrix, showing the forward elimination step and back-substitution. Give the final solution.

Level CNumPy implementationch09-C3

Detect a singular system in NumPy

Build the singular system $\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}\mathbf{x} = \begin{bmatrix} 3 \\ 7 \end{bmatrix}$ . Attempt np.linalg.solve inside a try/except np.linalg.LinAlgError, report the rank of $A$ versus the number of unknowns, and print ok.

Level DPaper-reading practicech09-D1

Why not just invert XᵀX?

Textbooks write the least-squares weights as $\boldsymbol{\theta} = (X^\top X)^{-1} X^\top \mathbf{y}$ , yet mature libraries never form that inverse — they call a solver like np.linalg.lstsq. Explain (a) why explicitly inverting $X^\top X$ is numerically risky, referencing the condition number, and (b) what regularization (ridge, $X^\top X + \lambda I$ ) does to solvability and conditioning.

Level AHand calculationch10-A1

Are these two vectors independent?

Are $\mathbf{v}_1 = (2, 3)$ and $\mathbf{v}_2 = (4, 6)$ linearly independent? Enter 1 for independent or 0 for dependent.

Level AHand calculationch10-A2

Read the rank off the columns

The matrix $A = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}$ has how many linearly independent columns, i.e. what is $\operatorname{rank}(A)$ ?

Level AEquation interpretationch10-A3

Where does e_2 land?

For $A = \begin{bmatrix} 2 & -1 \\ 1 & 3 \end{bmatrix}$ , compute $A\mathbf{e}_2$ where $\mathbf{e}_2 = (0,1)$ . Enter as x, y.

Level AHand calculationch10-A4

Rank–nullity arithmetic

A matrix $A$ has 5 columns and $\operatorname{rank}(A) = 3$ . What is $\dim \operatorname{Null}(A)$ ?

Level AEquation interpretationch10-A5

Dimension of a span

What is the dimension of $\operatorname{span}\{(1,0,0),\ (0,1,0),\ (1,1,0)\}$ in $\mathbb{R}^3$ ?

Level AEquation interpretationch10-A6

Definition of the null space

The null space of $A$ is the set of vectors $\mathbf{x}$ satisfying which equation?

Level BShape reasoningch10-B1

Column space vs. null space — which space?

Let $A$ be a $3 \times 5$ matrix. The column space $\operatorname{Col}(A)$ is a subspace of which space, and the null space $\operatorname{Null}(A)$ is a subspace of which space?

Level BShape reasoningch10-B2

Maximum possible rank

What is the largest value $\operatorname{rank}(A)$ can take for a $4 \times 7$ matrix $A$ ?

Level BEquation interpretationch10-B3

Independent but not orthogonal

True or false: 'If two vectors are linearly independent, they must be orthogonal.' Enter 1 for true or 0 for false, and be ready to justify.

Level BML applicationch10-B4

Why low rank loses information

A linear layer $\mathbf{y} = W\mathbf{x}$ has $W \in \mathbb{R}^{1000 \times 1000}$ but $\operatorname{rank}(W) = 10$ . In one or two sentences, explain what this implies about the outputs the layer can produce and about information lost from the input.

Level BEquation interpretationch10-B5

When does Ax = b have a solution?

For a fixed matrix $A$ , the system $A\mathbf{x} = \mathbf{b}$ has at least one solution exactly when:

Level CNumPy implementationch10-C1

Compute rank and test independence in NumPy

Build the matrix with columns $(1,2,3)$ , $(2,4,6)$ , and $(0,1,0)$ using np.column_stack. Print its rank with np.linalg.matrix_rank, decide whether the three columns are independent (via rank == n_cols), and print ok.

Level CDerivationch10-C2

Find a basis for a column space

The matrix $A$ has columns $\mathbf{a}_1 = (1,1,0)$ , $\mathbf{a}_2 = (2,2,0)$ , $\mathbf{a}_3 = (0,0,1)$ . Identify a basis for $\operatorname{Col}(A)$ and state $\operatorname{rank}(A)$ .

Level CDerivationch10-C3

Find a null-space vector by reasoning

For $A = \begin{bmatrix} 1 & 2 & 3 \\ 0 & 1 & 1 \end{bmatrix}$ , find a nonzero vector $\mathbf{x} = (x_1, x_2, x_3)$ with $A\mathbf{x} = \mathbf{0}$ , and explain why the null space must be nonzero here.

Level DPaper-reading practicech10-D1

Why low-rank adapters work (LoRA)

A LoRA adapter replaces a weight update $\Delta W \in \mathbb{R}^{d\times d}$ with a product $BA$ where $B \in \mathbb{R}^{d\times r}$ , $A \in \mathbb{R}^{r\times d}$ , and $r \ll d$ . (a) Prove $\operatorname{rank}(BA) \le r$ . (b) Count the parameters saved for $d = 4096, r = 8$ . (c) State the empirical hypothesis that makes this a good trade, and one situation where it would fail.

Level DML applicationch10-D2

Rank collapse through stacked layers

Consider a purely linear network $\mathbf{y} = W_3 W_2 W_1 \mathbf{x}$ with each $W_i \in \mathbb{R}^{d\times d}$ . (a) If $\operatorname{rank}(W_2) = k < d$ , what can you say about $\operatorname{rank}(W_3 W_2 W_1)$ ? (b) What does this imply about information flowing through the network, and (c) why do real networks insert nonlinearities between layers?

Level AHand calculationch11-A1

Verify an eigenvector

Let $A = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix}$ and $\mathbf{v} = (1, 1)$ . Compute $A\mathbf{v}$ ; it should equal $\lambda\mathbf{v}$ for some scalar. Enter the eigenvalue $\lambda$ .

Level AHand calculationch11-A2

Eigenvalues of a triangular matrix

Give the larger eigenvalue of the upper-triangular matrix $A = \begin{bmatrix} 5 & 2 \\ 0 & 1 \end{bmatrix}$ .

Level AHand calculationch11-A3

Sum of eigenvalues = trace

Without solving for them individually, give the sum of the two eigenvalues of $A = \begin{bmatrix} 6 & 2 \\ 2 & 3 \end{bmatrix}$ .

Level AHand calculationch11-A4

Explained variance ratio

A 2-D PCA yields covariance eigenvalues $\lambda_1 = 8$ and $\lambda_2 = 2$ . What fraction of the total variance is explained by the first principal component?

Level AEquation interpretationch11-A5

Eigenvectors of a symmetric matrix

A covariance matrix is symmetric. What is guaranteed about the eigenvectors belonging to its distinct eigenvalues?

Level AEquation interpretationch11-A6

What an eigenvalue means in PCA

In PCA, the eigenvalue $\lambda_k$ of the covariance matrix (for principal component $k$ ) equals which quantity?

Level BEquation interpretationch11-B1

Geometric meaning of an eigenvector

Which statement best describes an eigenvector $\mathbf{v}$ of a matrix $A$ geometrically?

Level BML applicationch11-B2

Why center before PCA?

Explain in a sentence or two why the data must be centered (mean subtracted) before forming the covariance matrix for PCA. What can the first component pick up if you forget?

Level BError identificationch11-B3

Eigenvector sign ambiguity

You run the same PCA twice. The first principal component returns as $(0.71, 0.71)$ one time and $(-0.71, -0.71)$ the next. Which explanation is correct?

Level BShape reasoningch11-B4

Shape of the covariance matrix

Your data matrix $X$ has shape $(n, d) = (1000, 50)$ — 1000 samples, 50 features. What is the shape of the covariance matrix $\Sigma = \frac{1}{n} X_c^\top X_c$ ?

Level BML applicationch11-B5

Zero covariance vs independence

The off-diagonal entry of a covariance matrix is $0$ . Explain what this does and does not guarantee about the two features, and give a concrete example where covariance is zero yet the features are perfectly dependent.

Level CHand calculationch11-C1

Eigenvalues from the characteristic equation

Use the characteristic equation $\det(A - \lambda I) = 0$ to find both eigenvalues of $A = \begin{bmatrix} 4 & 2 \\ 1 & 3 \end{bmatrix}$ . Enter them as larger, smaller.

Level CNumPy implementationch11-C2

Implement PCA projection from scratch

Write pca_project(X, k) that centers X (shape (n, d)), forms the covariance $\frac{1}{n} X_c^\top X_c$ , eigen-decomposes it with np.linalg.eigh, sorts components by descending eigenvalue, and returns the projection of the centered data onto the top k components (shape (n, k)). Test on a correlated 2-D dataset with k=1, assert the projected variance equals the top eigenvalue, and print ok.

Level CDerivationch11-C3

Derive: top eigenvector maximizes variance

Show that the unit direction $\mathbf{w}$ maximizing the projected variance $\mathbf{w}^\top \Sigma \mathbf{w}$ subject to $\lVert\mathbf{w}\rVert^2 = 1$ is an eigenvector of $\Sigma$ , and that the maximum value equals the largest eigenvalue.

Level CNumPy implementationch11-C4

Explained-variance ratio in NumPy

Given eigenvalues of a covariance matrix, write code that computes the cumulative explained-variance ratio and returns the smallest number of components needed to retain at least 90% of the variance. Test on eigenvalues [10.0, 4.0, 1.0, 0.5, 0.5], assert the answer is 3, and print ok.

Level DPaper-reading practicech11-D1

When does PCA fail?

PCA finds the best linear subspace. Describe a concrete dataset whose true structure is 1-dimensional but which PCA cannot compress to one component without large error, explain geometrically why PCA fails, and name one family of methods designed to handle it.

Level DPaper-reading practicech11-D2

Max variance is not always max usefulness

PCA keeps the directions of largest variance. Argue, with a concrete scenario, why the highest-variance direction can be the wrong thing to keep for a downstream classification task, and contrast PCA's objective with what a supervised method (e.g. LDA) optimizes instead.

Level AHand calculationch13-A1

Limit by direct substitution

Evaluate $\displaystyle\lim_{x \to 2} (3x + 1)$ . (The function is a polynomial, so it is continuous everywhere.)

Level AHand calculationch13-A2

A 0/0 limit by factoring

Evaluate $\displaystyle\lim_{x \to 3} \frac{x^2 - 9}{x - 3}$ . (Substitution gives $0/0$ ; factor first.)

Level AEquation interpretationch13-A3

One-sided limits and existence

For a piecewise function, $\lim_{x\to a^-} f(x) = 4$ and $\lim_{x\to a^+} f(x) = 4$ , but $f(a) = 9$ . What is $\lim_{x\to a} f(x)$ ?

Level AHand calculationch13-A4

Difference quotient of a linear function

For $f(x) = 5x + 2$ , the difference quotient $\frac{f(a+h)-f(a)}{h}$ simplifies to a constant. What is $f'(a)$ (the limit as $h\to 0$ )?

Level AEquation interpretationch13-A5

Classify the discontinuity

A step function has $\lim_{x\to a^-} f(x) = 0$ and $\lim_{x\to a^+} f(x) = 1$ . Which type of discontinuity is at $a$ ?

Level AEquation interpretationch13-A6

Continuity of ReLU at zero

Is $\mathrm{ReLU}(x) = \max(0, x)$ continuous at $x = 0$ ? Enter $1$ for yes, $0$ for no.

Level BEquation interpretationch13-B1

Limit vs. value

Which statement best captures why $\lim_{x\to a} f(x)$ can exist even when $f(a)$ is undefined?

Level BML applicationch13-B2

Why ReLU is not differentiable at 0

Explain, in terms of one-sided limits of the difference quotient, why $\mathrm{ReLU}$ is not differentiable at $0$ even though it is continuous there. What do frameworks do at that point?

Level BEquation interpretationch13-B3

The indeterminate form in the derivative

At $h = 0$ , the difference quotient $\frac{f(a+h)-f(a)}{h}$ has which form, and how is a finite derivative recovered from it?

Level BML applicationch13-B4

Why the finite-difference step can't be too small

A colleague validates a gradient with $\frac{f(a+h)-f(a)}{h}$ and reasons: 'smaller $h$ is always closer to the true limit, so I'll use $h = 10^{-15}$ .' Explain why this makes the numerical estimate worse, not better.

Level CDerivationch13-C1

Derive the derivative of a cubic

Using the limit definition $f'(a) = \lim_{h\to 0}\frac{f(a+h)-f(a)}{h}$ , derive $f'(a)$ for $f(x) = x^3$ . Show the $\frac{0}{0}$ cancellation explicitly.

Level CNumPy implementationch13-C2

Numerically approach a limit

Write numeric_derivative(f, a, h) returning the forward difference quotient $\frac{f(a+h)-f(a)}{h}$ . For $f(x)=x^2$ at $a=3$ , print the estimate for $h = 10^{-1}, 10^{-2}, 10^{-4}, 10^{-6}$ , assert the $h=10^{-6}$ estimate is within $10^{-3}$ of the exact value $6$ , then print ok.

Level CNumPy implementationch13-C3

Detect a discontinuity numerically

For the step function $f(x) = 0$ if $x < 0$ else $1$ , estimate the left and right limits at $a=0$ by evaluating $f$ at $a-h$ and $a+h$ for a shrinking sequence of $h$ . Print both estimates, assert they differ (confirming a jump discontinuity), and print ok.

Level DPaper-reading practicech13-D1

Subgradients and where non-differentiability bites

Deep networks are trained with gradient descent, yet ReLU, max-pooling, and the L1 penalty $|w|$ are all non-differentiable at isolated points. Explain why training still works (invoke measure-zero and subgradients), then give one concrete situation where a non-differentiable point does cause real trouble and how practitioners mitigate it.

Level AHand calculationch14-A1

Differentiate a polynomial, evaluate at a point

Let $f(x) = 3x^4 - 5x^2 + 7$ . Using the power, sum, and constant-multiple rules, compute $f'(1)$ .

Level AHand calculationch14-A2

Power rule at a point

Let $f(x) = x^5$ . Compute $f'(2)$ .

Level AHand calculationch14-A3

Derivative of the exponential

Let $f(x) = e^x$ . Compute $f'(0)$ . (Recall $e^0 = 1$ .)

Level AHand calculationch14-A4

Derivative of the natural log

Let $f(x) = \ln x$ . Compute $f'(2)$ .

Level AHand calculationch14-A5

Constant multiple rule

Let $f(x) = 4x^3$ . Compute $f'(2)$ .

Level AEquation interpretationch14-A6

Derivative of a constant

What is $\frac{d}{dx}(7)$ — the derivative of the constant function $f(x) = 7$ ?

Level AHand calculationch14-A7

Power rule with a fractional exponent

Let $f(x) = \sqrt{x} = x^{1/2}$ . Compute $f'(4)$ .

Level BHand calculationch14-B1

Apply the product rule

Let $h(x) = x^2 e^x$ . Using the product rule, compute $h'(1)$ . (Use $e \approx 2.71828$ .)

Level BHand calculationch14-B2

Apply the quotient rule

Let $h(x) = \dfrac{x}{x+1}$ . Using the quotient rule, compute $h'(1)$ .

Level BML applicationch14-B3

Sensitivity and the descent direction

A loss $L(w)$ has derivative $\frac{dL}{dw}\big|_{w=2} = +3$ . To decrease the loss, should you increase or decrease $w$ , and roughly how much does $L$ change if you nudge $w$ by $-0.01$ ?

Level BHand calculationch14-B4

A second derivative

Let $f(x) = x^3$ . Compute the second derivative $f''(2)$ , and state whether the curve is locally cupping upward or downward there.

Level BEquation interpretationch14-B5

Why central beats forward

The forward difference $\frac{f(x+h) - f(x)}{h}$ has error $O(h)$ ; the central difference $\frac{f(x+h) - f(x-h)}{2h}$ has error $O(h^2)$ . For a small $h$ , which is more accurate, and by roughly what factor does the central-difference error shrink when you halve $h$ ?

Level CDerivationch14-C1

Differentiate x³ from first principles

Using only the limit definition $f'(x) = \lim_{h\to0}\frac{f(x+h)-f(x)}{h}$ , derive the derivative of $f(x) = x^3$ . Show the cancellation of $h$ before taking the limit.

Level CNumPy implementationch14-C2

Implement the central-difference derivative

Implement numerical_derivative(f, x, h=1e-5) using the central difference $\frac{f(x+h)-f(x-h)}{2h}$ . Verify it against the analytic derivative for $f(x) = x^2$ (which is $2x$ ) and $f(x) = e^x$ at a couple of points using np.isclose, then print ok.

Level CDerivationch14-C3

Derive the quotient rule from the product rule

Assuming the product rule $(fg)' = f'g + fg'$ and the chain-rule fact $\frac{d}{dx}\big[g(x)^{-1}\big] = -g^{-2}g'$ , derive the quotient rule $\left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2}$ .

Level CNumerical experimentch14-C4

Sweep h to find the error minimum

For $f(x) = e^x$ at $x = 1$ , compute the central-difference error $\big|\text{numeric} - e\big|$ across $h = 10^{-1}, 10^{-2}, \ldots, 10^{-12}$ . Print each error, confirm with an assert that the minimum occurs at an interior $h$ (not the smallest one), then print ok.

Level DPaper-reading practicech14-D1

Designing a gradient check

You want to gradient-check a hand-written backprop by comparing its analytic gradient $g_a$ against a numerical gradient $g_n$ . (a) Why do practitioners compare the relative error $\frac{|g_a - g_n|}{|g_a| + |g_n| + \epsilon}$ rather than the absolute error? (b) Why use the central difference rather than the forward difference here? (c) Name one failure mode where a correct analytic gradient still fails a naive gradient check, and how to avoid it.

Level DPaper-reading practicech14-D2

Why h ≈ cube-root of machine epsilon

For the central difference, the total error is roughly $E(h) \approx C_1 h^2 + \frac{C_2 \varepsilon}{h}$ , where the first term is truncation and the second is round-off ( $\varepsilon$ is machine epsilon, $\approx 2.2\times10^{-16}$ ). Minimize $E$ over $h$ to explain why the optimal step is on the order of $\varepsilon^{1/3} \approx 10^{-5}$ , and state what that makes the best achievable error.

Level AHand calculationch15-A1

Chain rule at a point

Let $y = (3x + 2)^2$ . Use the chain rule to compute $\dfrac{dy}{dx}$ at $x = 1$ .

Level AHand calculationch15-A2

Forward pass of the graph

For $y = (wx + b)^2$ decomposed as $u = wx$ , $z = u + b$ , $y = z^2$ , run the forward pass with $w = 1$ , $x = 4$ , $b = -2$ and report $y$ .

Level AHand calculationch15-A3

Local derivative of the square node

The square node computes $y = z^2$ . What is its local derivative $\dfrac{dy}{dz}$ evaluated at $z = 5$ ?

Level AHand calculationch15-A4

Gradient with respect to the bias

For $y = (wx + b)^2$ , the backward pass gives $\dfrac{dy}{db} = 2(wx+b)$ . Evaluate it at $w = 2$ , $x = 3$ , $b = 1$ .

Level AHand calculationch15-A5

Gradient through an add node

In the graph $z = u + b$ , the gradient arriving at $z$ is $\dfrac{dy}{dz} = 6$ . What gradient does the add node pass back to $u$ ?

Level AHand calculationch15-A6

Gradient through a multiply node

In the graph $u = wx$ , the gradient arriving at $u$ is $\dfrac{dy}{du} = 14$ and $x = 3$ . What gradient does the multiply node pass back to $w$ ?

Level BEquation interpretationch15-B1

Why local derivatives multiply

For $y = f(g(x))$ with $u = g(x)$ , why is $\dfrac{dy}{dx} = \dfrac{dy}{du}\cdot\dfrac{du}{dx}$ a product of the two local rates rather than a sum?

Level BHand calculationch15-B2

Backward pass with new inputs

For $y = (wx + b)^2$ run forward with $w = -1$ , $x = 2$ , $b = 3$ , then use the backward pass to compute $\dfrac{dy}{dw}$ . (Recall $\dfrac{dy}{dw} = 2(wx+b)\,x$ .)

Level BML applicationch15-B3

Diagnosing a vanishing gradient

A 20-layer network uses an activation whose local derivative is at most $0.25$ everywhere. Roughly what happens to the gradient reaching the first layer, and why?

Level BML applicationch15-B4

Why cache the intermediates?

Backpropagation stores intermediate values (like $z$ in $y = z^2$ ) during the forward pass instead of recomputing them during the backward pass. In one or two sentences, explain why this matters for cost.

Level BEquation interpretationch15-B5

A variable used on two paths

In a graph, the variable $x$ feeds two different downstream nodes. When you run the backward pass, how do you combine the gradients arriving at $x$ from the two paths?

Level CNumPy implementationch15-C1

Implement backprop for (wx+b)²

Implement forward(w, x, b) returning $y=(wx+b)^2$ plus a cache, and backward(cache) returning $(\partial y/\partial w,\ \partial y/\partial x,\ \partial y/\partial b)$ by multiplying local derivatives back through $u=wx$ , $z=u+b$ , $y=z^2$ . Verify at $w=2,x=3,b=1$ that the gradients are $(42, 28, 14)$ , then print ok.

Level CDerivationch15-C2

Backprop a three-input product

Consider $y = (ab + c)^2$ with graph $u = ab$ , $z = u + c$ , $y = z^2$ . Derive $\dfrac{\partial y}{\partial a}$ , $\dfrac{\partial y}{\partial b}$ , and $\dfrac{\partial y}{\partial c}$ by the backward pass, then evaluate them at $a = 2$ , $b = 3$ , $c = -1$ .

Level CNumPy implementationch15-C3

Finite-difference gradient check

Write a central finite-difference checker for $y = (wx+b)^2$ . Use $\dfrac{\partial y}{\partial \theta} \approx \dfrac{y(\theta+\varepsilon) - y(\theta-\varepsilon)}{2\varepsilon}$ with $\varepsilon = 10^{-6}$ to approximate $\partial y/\partial w$ , $\partial y/\partial x$ , $\partial y/\partial b$ at $w=2,x=3,b=1$ , assert they match the analytic gradients $(42,28,14)$ with np.allclose, then print ok.

Level CNumerical experimentch15-C4

Simulate a vanishing-gradient product

The gradient at the first of $L$ layers is a product of $L$ local derivatives. Numerically compute the product of $L = 50$ factors each equal to $0.5$ , and separately each equal to $1.1$ , print both, and assert the first is below $10^{-10}$ and the second is above $100$ . End by printing ok.

Level DPaper-reading practicech15-D1

Why do gradients vanish, and how is it fixed?

Explain, using the chain rule, why deep networks with sigmoid activations suffer vanishing gradients. Then explain mechanistically how two of {ReLU activation, residual (skip) connections, batch/layer normalization} address it — pointing to what each does to the per-layer local derivative.

Level DPaper-reading practicech15-D2

Reverse-mode vs forward-mode autodiff

Backpropagation is reverse-mode automatic differentiation. For a function $f:\mathbb{R}^n \to \mathbb{R}$ (many inputs, scalar loss — the neural-network case), explain why reverse mode computes all $n$ input gradients in roughly one backward sweep, whereas forward-mode autodiff would need about $n$ passes. What does this asymmetry cost, and when would forward mode actually be preferable?

Level AHand calculationch17-A1

Gradient of the worked example

For $f(x, y) = x^2 + xy + y^2$ we found $\nabla f = (2x + y,\ x + 2y)$ . Evaluate the gradient at the point $(0, 3)$ . Enter as x, y.

Level AHand calculationch17-A2

A single partial derivative

Let $f(x, y) = 3x^2 + 2y$ . Compute $\dfrac{\partial f}{\partial x}$ and evaluate it at $x = 2$ (its value does not depend on $y$ ).

Level AHand calculationch17-A3

Partial of a product with powers

Let $f(x, y) = x^2 y^3$ . Compute $\dfrac{\partial f}{\partial y}$ and evaluate it at the point $(2, 1)$ .

Level AHand calculationch17-A4

Gradient of a linear function

Let $f(x, y, z) = x + 2y + 3z$ . Give the gradient $\nabla f$ . Enter as x, y, z.

Level AHand calculationch17-A5

The held variable is a constant, not zero

Let $f(x, y) = xy$ . Compute $\dfrac{\partial f}{\partial x}$ and evaluate it at the point $(5, 7)$ .

Level AEquation interpretationch17-A6

What does the gradient point toward?

At a given point, the gradient $\nabla f$ points in the direction of what?

Level BGraph interpretationch17-B1

Gradient and contour lines

On a contour plot of $f$ , what is the angle between the gradient $\nabla f$ at a point and the contour line passing through that same point?

Level BEquation interpretationch17-B2

The sign in the descent step

Gradient descent updates parameters as $\boldsymbol\theta \leftarrow \boldsymbol\theta - \eta\,\nabla L$ . Why is there a minus sign?

Level BShape reasoningch17-B3

Shape of the loss gradient

A model has $p$ parameters collected in $\boldsymbol\theta \in \mathbb{R}^p$ , and $L(\boldsymbol\theta)$ is a scalar loss. What is the shape of $\nabla L$ ?

Level BML applicationch17-B4

Reading a single loss partial

During training you compute $\dfrac{\partial L}{\partial \theta_j} = -0.8$ for one parameter. In one or two sentences, say what this tells you about $\theta_j$ , and which way gradient descent will move it.

Level BError identificationch17-B5

Spot the differentiation error

A student differentiates $f(x, y) = x^2 + xy + y^2$ and writes $\dfrac{\partial f}{\partial x} = 2x$ , reasoning that the $xy$ and $y^2$ terms 'have $y$ in them, so they're zero.' What did they get wrong, and what is the correct $\partial f/\partial x$ ?

Level CNumPy implementationch17-C1

Implement a numerical gradient

Write numerical_gradient(f, v, h=1e-5) that approximates $\nabla f$ at a point v using central differences, one coordinate at a time. Test it on $f(x, y) = x^2 + xy + y^2$ at $(1, 2)$ , confirm it matches the analytic answer $(4, 5)$ with np.allclose, and print ok.

Level CDerivationch17-C2

Why steepest ascent

Assume the directional derivative in a unit direction $\mathbf{u}$ is $D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}$ . Using the geometric form of the dot product, show that $f$ increases fastest when $\mathbf{u}$ points along $\nabla f$ .

Level CHand calculationch17-C3

One gradient-descent step

Minimize $f(x, y) = x^2 + xy + y^2$ by gradient descent, starting at $\boldsymbol\theta_0 = (1, 2)$ with learning rate $\eta = 0.1$ . Recall $\nabla f = (2x + y,\ x + 2y)$ . Compute $\boldsymbol\theta_1 = \boldsymbol\theta_0 - \eta\,\nabla f(\boldsymbol\theta_0)$ . Enter as x, y.

Level DPaper-reading practicech17-D1

Numerical gradients vs. backprop

The central-difference numerical gradient is simple and always available, yet no one trains large networks with it. Explain the cost of the numerical approach for a model with $p$ parameters, why backpropagation is dramatically cheaper, and one situation where the numerical gradient is still the right tool.

Level DPaper-reading practicech17-D2

How big should the step be?

Gradient descent moves $\boldsymbol\theta \leftarrow \boldsymbol\theta - \eta\,\nabla L$ . The gradient gives the best direction, but not how far to go. Describe qualitatively what goes wrong when the learning rate $\eta$ is far too large, and what goes wrong when it is far too small. Why can't we just read the ideal step size off the gradient itself?

Level AHand calculationch16-A1

Find the critical point

Find the critical point of $f(x) = x^2 - 6x + 5$ by solving $f'(x) = 0$ .

Level AHand calculationch16-A2

One gradient-descent step by hand

Minimize $L(x) = x^2$ with gradient descent. Starting from $x = 3$ with learning rate $\eta = 0.1$ , compute the value of $x$ after one update $x \leftarrow x - \eta\, L'(x)$ .

Level AEquation interpretationch16-A3

Classify with the second derivative

At a critical point $x^\*$ a function satisfies $f'(x^\*) = 0$ and $f''(x^\*) = 2$ . What kind of point is it?

Level AHand calculationch16-A4

The update factor for a quadratic

For $L(x) = x^2$ , one gradient-descent step is $x \leftarrow x - \eta(2x) = (1 - 2\eta)\,x$ . Compute the multiplicative factor $1 - 2\eta$ for $\eta = 0.25$ .

Level AEquation interpretationch16-A5

Name the learning rate

In the gradient-descent update $\theta \leftarrow \theta - \eta\,\nabla L(\theta)$ , which symbol is the learning rate?

Level AHand calculationch16-A6

Evaluate a gradient

For the loss $L(\theta) = \theta^2$ , compute the gradient $\nabla L(\theta) = L'(\theta)$ at $\theta = 4$ .

Level BEquation interpretationch16-B1

Diagnose a diverging run

You train a model and the loss increases every step until it prints NaN. Of the following, which single change is the most likely fix?

Level BEquation interpretationch16-B2

What convexity guarantees

The loss $L$ is convex (a single bowl). Which statement is true about running gradient descent with a small enough learning rate?

Level BEquation interpretationch16-B3

Why $f' = 0$ is not enough

Gradient descent slows to a near-stop because the gradient is almost zero, yet the loss is still high. Which explanation is consistent with this?

Level BML applicationch16-B4

The too-small learning rate

A colleague sets $\eta = 10^{-6}$ 'to be safe' and reports that after a full day of training the loss has barely moved, though it is slowly decreasing. In one or two sentences, explain what regime this is and the practical trade-off of a very small $\eta$ .

Level BEquation interpretationch16-B5

Why the minus sign?

The update is $\theta \leftarrow \theta - \eta\,\nabla L(\theta)$ — note the minus sign. Why do we step along $-\nabla L$ rather than $+\nabla L$ ?

Level CNumPy implementationch16-C1

Implement 1-D gradient descent

Write gradient_descent(grad, x0, eta, steps) that runs gradient descent in 1-D and returns the final $x$ . Use it to minimize $L(x) = (x - 3)^2$ (whose gradient is $2(x-3)$ ) starting from $x_0 = 0$ with $\eta = 0.1$ for 200 steps, assert the result is within $10^{-3}$ of $3$ , then print ok.

Level CDerivationch16-C2

Derive the convergence condition

For $L(x) = x^2$ , gradient descent is $x_{t+1} = x_t - \eta(2x_t) = (1 - 2\eta)x_t$ . Derive the exact range of learning rates $\eta > 0$ for which $x_t \to 0$ , and identify the value of $\eta$ that converges fastest.

Level CDerivationch16-C3

Minimizer of a general quadratic

Using the first- and second-derivative tests, derive the location $\theta^\*$ of the minimum of $L(\theta) = a\theta^2 + b\theta + c$ with $a > 0$ , and confirm it is a minimum.

Level CNumerical experimentch16-C4

Experiment: convergence vs divergence

Minimize $L(x) = x^2$ from $x_0 = 5$ . Run gradient descent for 20 steps with $\eta = 0.3$ and again with $\eta = 1.2$ . Print each final $x$ , assert that the first converges (near $0$ ) while the second diverges (magnitude above $1000$ ), then print ok.

Level DPaper-reading practicech16-D1

Non-convex, yet it works

Classical optimization theory guarantees gradient descent reaches the global minimum only for convex losses. Deep-network losses are highly non-convex, so we should expect descent to get trapped in bad local minima — yet in practice large models train well. Give the honest, current explanation for why, and state one concrete consequence for how practitioners think about training (e.g. what they do or do not worry about).

Level DPaper-reading practicech16-D2

Why saddles, not minima, dominate in high dimensions

A critical point ( $\nabla L = 0$ ) is a local minimum only if the curvature is positive in every direction. Using this fact, argue heuristically why, as the number of parameters $n$ grows, a random critical point is far more likely to be a saddle point than a local minimum. Then explain why this makes escaping such points a matter of finding a downhill direction rather than being truly stuck.

Level AShape reasoningch19-A1

Shape of a matrix product

A has shape (3, 4) and B has shape (4, 2). What is the shape of A @ B? Enter it as a tuple, e.g. (r, c).

Level AHand calculationch19-A2

Dot product by hand

Compute the dot product a @ b for a = np.array([1.0, 2.0, 3.0]) and b = np.array([4.0, 5.0, 6.0]).

Level AHand calculationch19-A3

Mean squared error by hand

For targets y = np.array([3.0, 5.0]) and predictions yhat = np.array([1.0, 4.0]), compute the mean squared error $\frac{1}{N}\sum_i (y_i - \hat y_i)^2$ .

Level AShape reasoningch19-A4

Shape of a per-feature mean

X has shape (100, 4) — 100 examples, 4 features. What is the shape of X.mean(axis=0)? Enter as a tuple, e.g. (k,).

Level AShape reasoningch19-A5

The other axis

Same X of shape (100, 4). What is the shape of X.mean(axis=1)? Enter as a tuple.

Level AEquation interpretationch19-A6

Which expression is the dot product?

For two 1-D arrays a and b of equal length, which expression computes their dot product $\sum_i a_i b_i$ as a single scalar?

Level BShape reasoningch19-B1

Shape of a distance matrix

X holds $N$ points as rows, shape (N, D). What is the shape of the pairwise Euclidean distance matrix $D_{\text{mat}}[i,j] = \lVert \mathbf{x}_i - \mathbf{x}_j \rVert$ ?

Level BDebuggingch19-B2

Debug a bias-shaped layer

In a batched layer, X is (N, D), W is (H, D), and an engineer writes b = np.zeros((H, 1)), then Y = X @ W.T + b. The output shape is not (N, H) as expected. What shape does Y actually have, why, and what is the one-character-idea fix?

Level BShape reasoningch19-B3

Shape of the difference block

X has shape (6, 3). What is the shape of X[:, None, :] - X[None, :, :] (the broadcast used to build a distance matrix)? Enter as a tuple, e.g. (a, b, c).

Level BML applicationch19-B4

Why guard the standard deviation?

Standardization computes Z = (X - mu) / sd with sd = X.std(axis=0). Explain in one or two sentences what happens if a feature column is constant, and how the guard sd = np.where(sd > 0, sd, 1.0) prevents it without distorting the data.

Level BEquation interpretationch19-B5

Loop vs vectorized: what actually gets faster?

An engineer replaces a Python for-loop dot product with a @ b and sees a large speedup. Which statement best explains why?

Level CNumPy implementationch19-C1

Implement per-column standardization

Write standardize(X) for a batch X of shape (N, D) that returns (X - mu) / sd using per-column statistics (axis=0), guarding against zero standard deviation. Verify the output has zero mean and unit std per column, assert the output shape equals the input shape, then print ok.

Level CNumPy implementationch19-C2

Batched linear layer: loop vs vectorized

Implement layer_loop(X, W, b) with explicit loops and layer_vec(X, W, b) as X @ W.T + b, for X of shape (N, D), W of shape (H, D), and b of shape (H,). Use a fixed seed, assert both outputs are (N, H) and agree with np.allclose, then print ok.

Level CDebuggingch19-C3

Debug a matmul shape error

An engineer has inputs X of shape (N, D) and weights W of shape (H, D) and writes Y = X @ W. It raises ValueError: matmul: ... (N,D) and (H,D). Explain precisely why matmul rejects these shapes, give the corrected expression, and state the resulting shape.

Level CNumPy implementationch19-C4

Vectorized Euclidean distance matrix

Implement dist_matrix(X) for X of shape (N, D) returning the (N, N) matrix of pairwise Euclidean distances, using broadcasting (no Python loop over pairs). Verify the diagonal is zero and the matrix is symmetric, assert the shape, then print ok.

Level DML applicationch19-D1

When vectorization costs too much memory

The broadcast distance matrix materializes an (N, N, D) difference block. For $N = 100{,}000$ and $D = 128$ in float64, estimate that block's memory, explain why the fully vectorized one-liner becomes impractical, and describe a strategy that keeps most of the vectorization speed without allocating the full block.

Level DNumerical experimentch19-D2

Choosing the step size for a central difference

A central difference $\frac{f(x+h) - f(x-h)}{2h}$ has truncation error $O(h^2)$ , which suggests taking $h$ as small as possible. Yet in float64 a tiny $h$ makes the estimate worse. Explain the two competing error sources, why the total error is minimized at an intermediate $h$ (roughly $h \sim \epsilon^{1/3}$ near machine epsilon $\epsilon \approx 2.2\times 10^{-16}$ ), and what this implies for verifying analytic gradients in ML.

Level AHand calculationch20-A1

Predict for one example

A linear model has weights $\mathbf{w} = (1, -1)$ and bias $b = 0.5$ . For the feature vector $\mathbf{x} = (2, 3)$ , compute the prediction $\hat{y} = \mathbf{w}^\top\mathbf{x} + b$ .

Level AHand calculationch20-A2

Compute the MSE

Predictions are $\hat{\mathbf{y}} = (3, 5, 4)$ and targets are $\mathbf{y} = (2, 5, 6)$ . Compute the mean squared error $L = \frac{1}{n}\sum_i (\hat{y}_i - y_i)^2$ .

Level AHand calculationch20-A3

The residual vector

With $\hat{\mathbf{y}} = (2, 4, 6)$ and $\mathbf{y} = (1, 5, 5)$ , compute the residual vector $\hat{\mathbf{y}} - \mathbf{y}$ . Enter as a, b, c.

Level AHand calculationch20-A4

Bias gradient by hand

For $n = 3$ examples the residuals are $\hat{\mathbf{y}} - \mathbf{y} = (-1, -2, -3)$ . Compute the bias gradient $\frac{\partial L}{\partial b} = \frac{2}{n}\sum_i (\hat{y}_i - y_i)$ .

Level AHand calculationch20-A5

Weight gradient by hand (one feature)

A single-feature dataset has $X = (1, 2, 3)^\top$ (so $d = 1$ , $n = 3$ ) and residuals $\hat{\mathbf{y}} - \mathbf{y} = (-1, -2, -3)$ . Compute the weight gradient $\nabla_{\mathbf{w}} L = \frac{2}{n} X^\top(\hat{\mathbf{y}} - \mathbf{y})$ .

Level AEquation interpretationch20-A6

When is the MSE zero?

The mean squared error $L = \frac{1}{n}\sum_i(\hat{y}_i - y_i)^2$ equals exactly $0$ in which situation?

Level BShape reasoningch20-B1

Shape of the weight gradient

The feature matrix $X$ has shape $100 \times 5$ ( $n = 100$ examples, $d = 5$ features). What is the shape of $\nabla_{\mathbf{w}} L = \frac{2}{n} X^\top(\hat{\mathbf{y}} - \mathbf{y})$ ?

Level BShape reasoningch20-B2

Why the transpose?

In the weight gradient we multiply the length- $n$ residual by $X^\top$ , not by $X$ (where $X$ is $n \times d$ ). Which multiplication produces a length- $d$ result?

Level BML applicationch20-B3

What a large learning rate does

During gradient descent you notice the loss increasing each epoch, eventually reaching inf. In one or two sentences, explain the most likely cause and the first fix to try.

Level BEquation interpretationch20-B4

Reading the sign of the bias gradient

At some point during training $\frac{\partial L}{\partial b} = \frac{2}{n}\sum_i(\hat{y}_i - y_i)$ is positive. What does this say about the current predictions, and which way will the update $b \leftarrow b - \eta\,\frac{\partial L}{\partial b}$ move $b$ ?

Level BML applicationch20-B5

Why GD and the closed form agree

For linear regression, well-tuned gradient descent converges to the same weights as the normal equations $\mathbf{w} = (X^\top X)^{-1}X^\top\mathbf{y}$ . What property of the MSE loss guarantees this, and why would the guarantee fail for a general neural network?

Level CNumPy implementationch20-C1

Implement predict, mse, and gradients

Implement predict(X, w, b), mse(yhat, y), and gradients(X, y, w, b) returning (dw, db) exactly per the formulas. Then, on a seeded random dataset, verify with assert that dw has shape (d,), that db is a scalar, and print ok.

Level CHand calculationch20-C2

One gradient-descent step by hand

Start from $w = 1$ , $b = 0$ on the dataset $X = (1,2,3)^\top$ , $\mathbf{y} = (2,4,6)$ . The gradients there are $\nabla_{\mathbf{w}} L = -\frac{28}{3}$ and $\frac{\partial L}{\partial b} = -4$ . With learning rate $\eta = 0.1$ , compute the updated weight $w_{\text{new}} = w - \eta\,\nabla_{\mathbf{w}} L$ .

Level CDerivationch20-C3

Derive the bias gradient

Starting from $L = \frac{1}{n}\sum_i r_i^2$ with $r_i = (X\mathbf{w} + b)_i - y_i$ , derive $\frac{\partial L}{\partial b} = \frac{2}{n}\sum_i(\hat{y}_i - y_i)$ using the chain rule. State $\frac{\partial r_i}{\partial b}$ explicitly.

Level CNumPy implementationch20-C4

Finite-difference gradient check

You have an analytic gradients(X, y, w, b). Write a central-difference gradient check that perturbs each parameter by $\varepsilon = 10^{-5}$ , estimates the gradient numerically as $\frac{L(\theta+\varepsilon) - L(\theta-\varepsilon)}{2\varepsilon}$ , and asserts the max absolute difference from the analytic gradient is < 1e-5. Print ok.

Level DPaper-reading practicech20-D1

Normal equations vs. gradient descent

You must fit a least-squares model in two scenarios: (a) $n = 10^4$ examples, $d = 50$ features; (b) $n = 10^8$ examples streamed from disk, $d = 10^5$ sparse features. For each, argue which of the normal equations $\mathbf{w} = (X^\top X)^{-1}X^\top\mathbf{y}$ or gradient descent you would use, citing the cost that dominates. Then name one situation where the normal equations are not merely slow but impossible.

Level DDerivationch20-D2

Deriving the ridge-regression gradient

Ridge regression minimizes $L_{\text{ridge}} = \frac{1}{n}\sum_i(\hat{y}_i - y_i)^2 + \lambda\lVert\mathbf{w}\rVert^2$ with $\lambda > 0$ . (i) Derive $\nabla_{\mathbf{w}} L_{\text{ridge}}$ . (ii) Show that setting it to zero gives the modified normal equations $(X^\top X + n\lambda I)\mathbf{w} = X^\top\mathbf{y}$ (ignore the bias / assume centered data). (iii) Explain in one sentence why this fixes a singular $X^\top X$ .

Exercise Center

Evaluate with precedence

The unary minus trap

Subtract two fractions

Smallest set containing a number

Infer the NumPy dtype

Floor division

Which quantity must be a float?

Integers cannot hold weights

Default dtype of zeros

Reading a subscripted sum

Fix the accuracy bug

Hand-evaluate, then check in NumPy

When 0.1 + 0.2 is not 0.3

Why dtype choice is a modeling decision

Solve a one-step linear equation

Variables on both sides

Solve a linear inequality

The sign-flip rule

Rearrange the line for the intercept

Formula from ax + b = 0

When does the relation flip?

Constraints as inequalities

Inequality becomes a boolean mask

Translate a word problem

Verify a hand solution in NumPy

Rearrange and derive: solve for the parameter

Enforce a probability constraint with a mask

A system of constraints defines a feasible region

Evaluate a logarithm

Product law of exponents

Root as a fractional power

A base-10 logarithm

exp and ln are inverses

Negative exponent

Exponential to log form

Spot the invalid log identity

Change of base, numerically

Why maximize log-likelihood?

Reading the log-likelihood equation

Implement a stable log-likelihood

Derive the quotient rule

Numerically stable log-sum-exp

Why subtract the max in log-sum-exp?

Cross-entropy as negative log-likelihood

Expand and evaluate a sum

Count the terms

Evaluate a double sum

Evaluate a product

argmax returns an index

Read a superscript index

Cardinality of a training set

Reading a conditional sum

argmin gives parameters, not the loss

Superscript: example or power?

Which index does softmax sum over?

Drop the constants: Big-O

Implement MSE from its sum

Softmax then argmax

A coupled double sum is a matmul

Pull a constant out of a sum

Reason about scaling from Big-O

Decode an unfamiliar loss

Evaluate a function

Compose at a point

Order matters

Find an inverse value

Domain of a reciprocal

Range of a square

Is it a function?

Codomain versus range

Which function is invertible?

Inputs of a model and a loss

Compose two functions symbolically

Derive an inverse

Composition on a grid in NumPy

Why deep networks are compositions

Sigmoid at zero

ReLU of a negative input

ReLU of a positive input