Exercise Center
320 exercises across the course.
Showing 320 exercises.
Evaluate with precedence
Evaluate the expression by hand, applying operator precedence.
The unary minus trap
Evaluate . (This is the classic precedence trap — the exponent binds tighter than the leading minus.)
Subtract two fractions
Compute and give the result as a decimal.
Smallest set containing a number
What is the smallest of the sets that contains the number ?
Infer the NumPy dtype
In NumPy, what is the dtype of np.array([1, 2, 3])? Answer with the exact dtype name (e.g. int64 or float64).
Floor division
What does the floor-division expression 17 // 5 evaluate to in Python/NumPy?
Which quantity must be a float?
In a training loop, which of the following quantities should be stored as a float rather than an integer?
Integers cannot hold weights
Explain, in one or two sentences, why initializing a weight array with dtype=int and then applying a fractional gradient update leaves the weights unchanged.
Default dtype of zeros
What is the dtype of the array produced by np.zeros(3) (with no dtype argument)?
Reading a subscripted sum
In the expression , the index and the symbols come from different number worlds. Which set does the index range over, and which set do the values belong to?
Fix the accuracy bug
A metric function reports accuracy when it should report . The buggy line is acc = correct // total with correct = 3, total = 4. Write a corrected accuracy(correct, total) that returns a float, verify it gives 0.75 for these inputs, and print ok.
Hand-evaluate, then check in NumPy
By hand, evaluate . Then confirm the value in NumPy, print the dtype you observe for the result, and print ok.
When 0.1 + 0.2 is not 0.3
Reals in are exact, but float64 is not. Write NumPy code showing that 0.1 + 0.2 == 0.3 is False, then show that np.isclose(0.1 + 0.2, 0.3) is True, and print ok. In a comment, explain in one line why exact equality fails.
Why dtype choice is a modeling decision
Modern deep learning increasingly trains in float16 / bfloat16 (16-bit) rather than float64. Give one concrete benefit of the lower-precision float, one concrete risk it introduces, and explain why an integer dtype is nonetheless still the right choice for token IDs and class labels. Tie each point back to the -labels-versus--measures distinction.
Solve a one-step linear equation
Solve for .
Variables on both sides
Solve for .
Solve a linear inequality
Solve and give the threshold value of (the boundary of the solution interval).
The sign-flip rule
Solve and give the boundary value of . (Remember what happens to the relation when you divide by a negative.)
Rearrange the line for the intercept
The line is . Solve for the intercept in terms of , , and .
Formula from ax + b = 0
For the general linear equation with , what is the unique solution ?
When does the relation flip?
Which single operation, applied to both sides of an inequality, reverses its direction (e.g. turns into )?
Constraints as inequalities
In machine learning, a learning rate must satisfy and a predicted probability must satisfy . In one or two sentences, explain why these are written as inequalities rather than equations, and what would go wrong if a value violated its constraint.
Inequality becomes a boolean mask
In NumPy, p is a 1-D array of shape . What does the expression p >= 0.5 evaluate to?
Translate a word problem
A batch has examples. Each epoch processes the whole batch once, and you want to run enough epochs that the model sees at least examples total. Write an inequality for the number of epochs in terms of , then solve it for .
Verify a hand solution in NumPy
You solved by hand and got . Write NumPy code that substitutes into both sides, asserts they agree with np.isclose, and prints ok.
Rearrange and derive: solve for the parameter
The standardization (z-score) formula is with . Derive the formula that recovers the original value from a given , showing each balanced move.
Enforce a probability constraint with a mask
Generate 8 values with a fixed seed via rng = np.random.default_rng(0) and raw = rng.standard_normal(8) (these can fall outside ). Build a boolean mask of which entries already satisfy , then use np.clip to force all entries into . Assert every clipped value satisfies the constraint and print ok.
A system of constraints defines a feasible region
A tiny training budget imposes two simultaneous constraints on the number of training steps : memory allows , and you need enough steps to converge, . (1) Describe the set of valid as an interval. (2) Now suppose someone also requires . Explain what happens to the feasible region and what that means practically. (3) Relate this to why an optimization problem can be infeasible.
Evaluate a logarithm
Evaluate . (Read it as: ' to what power gives ?')
Product law of exponents
Use to evaluate as a single integer.
Root as a fractional power
Write as a single power . Enter the exponent as a decimal.
A base-10 logarithm
Evaluate .
exp and ln are inverses
Evaluate .
Negative exponent
Evaluate as a decimal.
Exponential to log form
The statement is written in exponential form. Which is its correct logarithmic form?
Spot the invalid log identity
Which of the following is not a valid logarithm identity?
Change of base, numerically
Using change of base with and , evaluate . Give three decimals.
Why maximize log-likelihood?
Training maximizes the log-likelihood rather than the raw likelihood . Give two distinct reasons this substitution is safe (same optimum) and better.
Reading the log-likelihood equation
A model reports a total log-likelihood of over a dataset. Which statement is the best interpretation?
Implement a stable log-likelihood
Write log_likelihood(p) that returns for a 1-D array of probabilities. Show that for tiny probabilities the naive product np.prod(p) underflows to 0.0 while your log-space sum stays finite, then print ok.
Derive the quotient rule
Derive the quotient rule starting from an exponent law, for .
Numerically stable log-sum-exp
Implement logsumexp(z) computing with the max-subtraction trick, and show it agrees with the naive formula on safe inputs but does not overflow on large logits like z = [1000, 1001, 1002]. Print ok.
Why subtract the max in log-sum-exp?
The log-sum-exp identity holds for any constant . Prove the identity algebraically, then explain why the specific choice is the numerically safe one — addressing both overflow and underflow.
Cross-entropy as negative log-likelihood
For a classifier that outputs probability on the correct class of example , the average cross-entropy loss is . Explain why minimizing is the same as maximizing the likelihood , and interpret the per-example loss as an amount of 'surprise' — including what happens as and as .
Expand and evaluate a sum
Expand and compute .
Count the terms
How many terms does have?
Evaluate a double sum
For , compute .
Evaluate a product
Compute .
argmax returns an index
A classifier outputs probabilities over classes indexed . What is (use 0-based indexing)?
Read a superscript index
In the standard ML convention, what does denote?
Cardinality of a training set
A training set is written . What is ?
Reading a conditional sum
What does compute?
argmin gives parameters, not the loss
Training is written . What is ?
Superscript: example or power?
You read in a paper. Which statement is correct?
Which index does softmax sum over?
In , which is true about the indices and ?
Drop the constants: Big-O
An algorithm runs in operations. Its running time is:
Implement MSE from its sum
Translate into a function mse(yhat, y) using NumPy. Verify it on , (which should give ), then print ok.
Softmax then argmax
Implement softmax(z) returning , confirm the output sums to , and report the predicted class via np.argmax. Use and print ok.
A coupled double sum is a matmul
The layer score is . Implement it two ways — an explicit double loop and W @ x — for a random of shape and of length (fixed seed). Assert they agree and print match. In a comment, state the Big-O cost.
Pull a constant out of a sum
Prove the linearity fact for any scalar constant , and explain why this justifies writing MSE with the outside the sum.
Reason about scaling from Big-O
A paper reports that a dense layer costs (for tokens of dimension ) while self-attention costs . For a fixed , if you double the sequence length , by what factor does each cost grow? Then explain, using only Big-O reasoning, why attention becomes the bottleneck for long sequences even though both are 'polynomial'.
Decode an unfamiliar loss
Using only the notation from this chapter, decode the (binary) cross-entropy loss . Identify each symbol and its type, expand the summand for a single example with label , and state in one sentence what the loss rewards.
Evaluate a function
Let . Compute .
Compose at a point
Let and . Compute .
Order matters
For and , compute . (Compare with A2, where .)
Find an inverse value
The function has inverse . Compute .
Domain of a reciprocal
For over the real numbers, which single value of must be excluded from the domain?
Range of a square
For with domain all real numbers, what is the smallest value in the range?
Is it a function?
A relation is graphed in the plane. Which single test decides whether it defines as a function of ?
Codomain versus range
Explain in one or two sentences the difference between the codomain and the range of a function , using with as an example.
Which function is invertible?
Each function below has domain all of . Which one is invertible (one-to-one)?
Inputs of a model and a loss
A model is written and its loss . In one or two sentences, say which quantity is held fixed and which varies (a) at prediction time and (b) at training time, and why is written as a function of alone.
Compose two functions symbolically
Let and . Derive closed-form expressions for and , and confirm they are different functions.
Derive an inverse
The function is one-to-one on . Derive its inverse and verify .
Composition on a grid in NumPy
Implement and as Python functions. Using np.linspace(-3, 3, 7), evaluate and on the grid, confirm with np.allclose that they are not equal everywhere, and print ok.
Why deep networks are compositions
A network layer is a function , and an -layer network is the composition . (a) Explain why stacking only linear layers gains no expressive power over a single linear layer. (b) Explain what a nonlinear activation inserted between layers changes. (c) Connect the composition structure to why the chain rule is central to training.
Sigmoid at zero
Compute for the sigmoid .
ReLU of a negative input
Compute .
ReLU of a positive input
Compute .
Range of tanh
What is the range of ?
Leaky ReLU on a negative input
Using the course's Leaky ReLU with leak , so for and for , compute .
Maximum slope of the sigmoid
The sigmoid derivative is . What is its maximum value?
Which activation is zero-centered?
Which of these hidden-layer activations is zero-centered (outputs symmetric about )?
Match the graph to the function
A plotted curve is for all negative inputs, then rises as a straight line of slope for positive inputs, with a sharp corner at the origin. Which function is it?
Range of softplus
What is the range of softplus ?
Saturation and vanishing gradients
In a deep network of sigmoid layers, explain why pushing many neurons into their saturated regions (large ) makes the early layers train slowly. Reference the chain rule and the size of .
Sigmoid output, ReLU hidden
A binary classifier commonly uses ReLU in its hidden layers but a sigmoid on the final output neuron. Give the reason for each choice in one sentence.
Implement ReLU and Leaky ReLU
Implement vectorized relu(x) and leaky_relu(x, alpha=0.1) for a 1-D NumPy array. Verify on x = np.array([-2.0, 0.0, 3.0]) that ReLU gives [0, 0, 3] and Leaky ReLU gives [-0.2, 0, 3], then print ok.
Numerically stable softplus
Implement softplus(x) that does not overflow for large inputs, and confirm softplus(1000.0) is finite (and ) while the naive np.log(1 + np.exp(1000.0)) is inf. Also check numerically that the derivative of softplus equals the sigmoid. Print ok.
Derive the sigmoid's derivative
Show that for .
Simulate a vanishing gradient
Empirically show gradient vanishing: multiply together the sigmoid derivatives across a stack of layers whose pre-activations are all saturated (e.g. ). Compare the product for a 20-layer sigmoid stack against a 20-layer ReLU stack (per-layer derivative for ). Print both products, assert the sigmoid product is far smaller, and print ok.
Why did ReLU help deep networks?
Before ~2011, deep networks used sigmoid/tanh and were notoriously hard to train past a few layers; switching to ReLU was a turning point. Explain the main mechanism by which ReLU improved trainability of deep nets, name at least one additional practical benefit, and state one drawback ReLU introduced (with a named remedy).
Choosing a positivity-preserving output activation
You are designing a network whose scalar output must be strictly positive (e.g. it predicts the variance of a Gaussian, or a rate parameter). Compare using , , and softplus as the final activation. Which would you choose and why? Address the range, gradient behavior, and numerical stability of each.
Read a shape off a nested list
What is the .shape of np.array([[1, 2, 3], [4, 5, 6]])? Enter it as a tuple, e.g. (2, 3).
Rank of a 3-D array
An array has shape (2, 3, 4). What is its .ndim (its rank, the number of axes)?
Default integer dtype
On a 64-bit platform, what dtype does np.array([1, 2, 3]) get? Enter the dtype name, e.g. int64.
Broadcast a column and a row
What is the output shape of np.arange(3).reshape(3, 1) + np.arange(4).reshape(1, 4)? Enter it as a tuple, e.g. (3, 4).
Shape after a sum over axis 0
For A with shape (2, 3), what is the shape of A.sum(axis=0)? Enter it as a tuple, e.g. (3,).
Integer floor division
What is np.array([1, 2, 3]) // 2? Enter the three resulting values as a, b, c.
Index a 2-D array
Let A = np.arange(12).reshape(3, 4). What single value is A[1, 2]?
Broadcast a matrix column against a vector
What is the output shape of an operation between an array of shape (4, 1) and an array of shape (3,)? Enter it as a tuple, e.g. (4, 3).
Which pair fails to broadcast?
Which of the following pairs of shapes cannot be broadcast together?
The (n,) vs (n,1) trap
You have A of shape (5, 3) and a vector v of shape (3,). What does A - v do, and what shape is the result?
Dtype of a mixed sum
a = np.array([1, 2, 3]) (int64) and b = np.array([1.0, 2.0, 3.0]) (float64). What is the dtype of a + b?
Integer overflow
What is the value of (np.array([127], dtype=np.int8) + np.int8(1))[0]?
Per-feature standardization by broadcasting
Given a batch X of shape (N, D), implement standardize(X) that returns (X - mu) / sigma, where mu and sigma are the per-feature (per-column) mean and standard deviation. Use axis and broadcasting — no Python loops. Verify the output has per-feature mean and std , then print ok.
Row-normalize with keepdims
Given a non-negative array A of shape (N, D), implement row_normalize(A) so that each row sums to 1. Use keepdims=True so the row sums broadcast back onto A. Assert each row sums to 1, then print ok.
Predict-then-verify a broadcast shape
Write broadcast_shape(s1, s2) that returns the broadcast output shape (as a tuple) of two shapes, or raises ValueError if they are incompatible — implementing the rule by hand (do not call np.broadcast_shapes). Test that broadcast_shape((2, 1, 3), (4, 3)) == (2, 4, 3) and that (3, 4) with (3,) raises, then print ok.
Boolean mask and fancy indexing
Given A = np.arange(12).reshape(3, 4), use a boolean mask to extract all entries greater than 5 into a 1-D array, and separately use fancy indexing to build a new array from rows [2, 0] in that order. Assert the mask result equals [6, 7, 8, 9, 10, 11] and the fancy result has shape (2, 4), then print ok.
Diagnose a silent broadcasting bug
An engineer wants element-wise squared errors between predictions pred and targets true, both length-1000 vectors, and writes err = pred[:, None] - true[None, :], then mse = (err ** 2).mean(). The code runs but the reported MSE is wrong and memory usage spikes. Explain what shape err actually has, why it runs without error, what the correct one-liner is, and what single assertion would have caught the bug immediately.
float32 vs float64 in a training pipeline
Deep-learning frameworks default model weights and activations to float32, while NumPy defaults to float64. Give two concrete reasons float32 is preferred for large-scale training, one concrete risk it introduces, and one place in a pipeline where you would deliberately switch back to float64.
Compute a dot product
Let and . Compute .
Vector addition
Compute for and . Enter the result as x, y.
Scalar multiplication
Compute for . Enter as x, y, z.
Length of a vector
Compute the Euclidean length for .
A linear combination
With and , compute . Enter as x, y.
Orthogonality check
Two vectors are orthogonal when their dot product is which value?
Sign of the dot product
Two nonzero vectors have a negative dot product. Which is true about the angle between them?
Why the weighted sum?
In a neuron , explain in one or two sentences what a large positive weight means about feature 's influence on , and what a weight near zero means.
Shape of a dot product
If (say, two word embeddings), what is the shape of ?
Commutativity of the dot product
Show that the dot product is commutative: for all .
Implement cosine similarity
Implement cosine_similarity(a, b) for two 1-D NumPy arrays, returning . Verify it returns approximately 1.0 for two parallel vectors, then print ok.
Loop vs vectorized dot product
Write dot_loop(a, b) (an explicit Python loop) and dot_vec(a, b) (using @). Generate a, b of length 100000 with a fixed seed, confirm the results agree with np.isclose, and print match. In a comment, state which is faster and why.
Derive the projection formula
The projection of onto is the vector such that the error is orthogonal to . Derive .
Why cosine, not dot product, for similarity?
Retrieval systems and embedding papers usually rank by cosine similarity rather than the raw dot product. Give a concrete example of vectors where the raw dot product is misleading but cosine is not, explain what property cosine adds, then name one situation where the raw dot product is the right choice.
Compute an L2 norm
Compute the L2 (Euclidean) norm of .
Compute an L1 norm
Compute the L1 (Manhattan) norm of .
Compute an L∞ norm
Compute the L∞ (max) norm of .
Euclidean distance between two points
Compute the Euclidean distance between and .
Cosine of orthogonal vectors
Compute the cosine similarity of and .
Cosine of parallel vectors
Compute the cosine similarity of and .
Ordering of the three norms
For any vector , which chain of inequalities among its L1, L2, and L∞ norms always holds?
Which penalty produces sparsity?
You want a regression model that sets many weights to exactly zero for feature selection. Do you add an L1 or an L2 penalty to the loss, and what is the resulting method called?
Distance versus difference of norms
A colleague computes the 'distance' between and as and gets , concluding the points coincide. In one or two sentences, explain the error and give the correct Euclidean distance.
Why cosine for retrieval
A document embedding and the same document repeated twice, embedded as roughly , should be judged equally relevant to a query . Explain in one or two sentences why cosine similarity gives the same score for and but the raw dot product does not.
Norm versus squared norm in regularization
Ridge regression penalizes , not . Give the main reason the squared norm is preferred as the penalty term.
Implement the general Lp norm
Implement lp_norm(x, p) for a 1-D NumPy array, returning , and handle p = np.inf as the max norm. Verify against np.linalg.norm(x, ord=p) for on , then print ok.
Cosine similarity with a zero-vector guard
Implement cosine(a, b) returning , but return 0.0 if either vector is the zero vector (so it never emits nan). Clamp the result to . Verify it gives 1.0 for parallel vectors, 0.0 for orthogonal ones, and 0.0 for a zero input, then print ok.
Derive cosine similarity from the dot product
Starting from the geometric form of the dot product, , derive the cosine similarity formula and explain why the result must lie in and why it is scale-invariant.
Why L1 produces sparsity but L2 does not
Ridge (L2) and Lasso (L1) both shrink weights, yet only Lasso drives many of them to exactly zero. Using the geometry of the L1 and L2 unit balls in 2-D, explain why the L1 constraint favors solutions on the axes (sparse) while the L2 constraint does not. Then name one concrete situation where you would deliberately prefer L2 over L1.
Shape of a matrix product
has shape and has shape . What is the shape of ? Enter as (rows, cols).
Matrix–vector product by hand
Let and . Compute . Enter as y1, y2.
Indexing an entry
For , what is A[1][2] using 0-based indexing (as in NumPy: row first, column second)?
Shape after transpose
has shape . What is the shape of ? Enter as (rows, cols).
Matrix addition
Compute for and . Enter the entries row by row: a11, a12, a21, a22.
Trace of a matrix
Compute for .
Tracking shapes through a chain
With of shape , of shape , and of shape , what is the shape of ? Enter as (rows, cols).
Why AB is not BA
Let have shape and have shape . Which statement is correct?
Ax as a combination of columns
Using the column view, compute for and : form . Enter as y1, y2.
The identity does nothing
For of shape and the identity, what is ?
The batch layer shape
A dense layer has weights of shape (20 outputs, 10 inputs). A batch of data has shape (64 examples, each with 10 features). What is the shape of ? Enter as (rows, cols).
Implement matmul: loop vs @
Write matmul_loop(A, B) using an explicit triple loop over (the entry formula ), and confirm it agrees with A @ B on random matrices with a fixed seed. Assert the output shape is (A.shape[0], B.shape[1]), then print ok.
Matrix–vector product, both views
Implement matvec_rows(A, x) (stack of row dot products) and matvec_cols(A, x) (weighted sum of columns), and confirm both equal A @ x. Use a fixed seed with of shape and of length . Print ok.
Transpose reverses a product
Verify numerically that (and that the naive generally is not equal, and may even be a shape error). Use a fixed seed with of shape and of shape . Print ok.
A Gram matrix is symmetric
For any data matrix of shape , show numerically that the Gram matrix is square, has shape , and is symmetric (). Use a fixed seed with , , and print ok.
Attention, purely by shape
Self-attention computes , where each have shape (here = sequence length, = head dimension). Softmax is applied row-wise and does not change shape. Working purely from the inner/outer rule, (a) give the shape of the score matrix , (b) give the shape of the final output, and (c) explain in one sentence why the score matrix is regardless of .
Debug a stacked-layer shape chain
An engineer stacks two dense layers on a batch. The data is of shape ; layer 1 has weights of shape ; layer 2 has weights of shape . They write Z = X @ W1 @ W2 and get a shape error on the first product. Explain why it fails, write the corrected expression using transposes, and give the shape after each matmul.
Solve a 2×2 system by elimination
Solve the system
Enter the solution as x, y.
Another 2×2 solve
Solve
Enter as x, y.
Write a system in matrix form
The system is written as . What is the right-hand side vector ? Enter as b1, b2.
Count the free variables
A consistent system in unknowns has . How many free variables does its solution set have?
Reading an inconsistent row
After elimination, a row of the augmented matrix becomes . How many solutions does the system have?
Determinant and uniqueness
For a square system , a unique solution for every is guaranteed exactly when is which value?
Classify: parallel lines
The two equations of a 2×2 system plot as parallel but distinct lines. How many solutions does the system have?
Rank decides the outcome
A system has unknowns with . Which outcome holds?
Shapes in the normal equations
In least squares the design matrix is with examples and features. In the normal equations , what is the shape of the coefficient matrix ?
When is XᵀX singular?
Explain, in one or two sentences, why becomes singular when two feature columns of are exactly proportional (perfectly collinear), and what this means for the regression weights.
Solve a 3×3 system in NumPy
Use np.linalg.solve to solve
Verify the solution with np.allclose(A @ x, b), then print ok.
Elimination to RREF by hand
Solve
by Gaussian elimination on the augmented matrix, showing the forward elimination step and back-substitution. Give the final solution.
Detect a singular system in NumPy
Build the singular system . Attempt np.linalg.solve inside a try/except np.linalg.LinAlgError, report the rank of versus the number of unknowns, and print ok.
Why not just invert XᵀX?
Textbooks write the least-squares weights as , yet mature libraries never form that inverse — they call a solver like np.linalg.lstsq. Explain (a) why explicitly inverting is numerically risky, referencing the condition number, and (b) what regularization (ridge, ) does to solvability and conditioning.
Are these two vectors independent?
Are and linearly independent? Enter 1 for independent or 0 for dependent.
Read the rank off the columns
The matrix has how many linearly independent columns, i.e. what is ?
Where does e_2 land?
For , compute where . Enter as x, y.
Rank–nullity arithmetic
A matrix has 5 columns and . What is ?
Dimension of a span
What is the dimension of in ?
Definition of the null space
The null space of is the set of vectors satisfying which equation?
Column space vs. null space — which space?
Let be a matrix. The column space is a subspace of which space, and the null space is a subspace of which space?
Maximum possible rank
What is the largest value can take for a matrix ?
Independent but not orthogonal
True or false: 'If two vectors are linearly independent, they must be orthogonal.' Enter 1 for true or 0 for false, and be ready to justify.
Why low rank loses information
A linear layer has but . In one or two sentences, explain what this implies about the outputs the layer can produce and about information lost from the input.
When does Ax = b have a solution?
For a fixed matrix , the system has at least one solution exactly when:
Compute rank and test independence in NumPy
Build the matrix with columns , , and using np.column_stack. Print its rank with np.linalg.matrix_rank, decide whether the three columns are independent (via rank == n_cols), and print ok.
Find a basis for a column space
The matrix has columns , , . Identify a basis for and state .
Find a null-space vector by reasoning
For , find a nonzero vector with , and explain why the null space must be nonzero here.
Why low-rank adapters work (LoRA)
A LoRA adapter replaces a weight update with a product where , , and . (a) Prove . (b) Count the parameters saved for . (c) State the empirical hypothesis that makes this a good trade, and one situation where it would fail.
Rank collapse through stacked layers
Consider a purely linear network with each . (a) If , what can you say about ? (b) What does this imply about information flowing through the network, and (c) why do real networks insert nonlinearities between layers?
Verify an eigenvector
Let and . Compute ; it should equal for some scalar. Enter the eigenvalue .
Eigenvalues of a triangular matrix
Give the larger eigenvalue of the upper-triangular matrix .
Sum of eigenvalues = trace
Without solving for them individually, give the sum of the two eigenvalues of .
Explained variance ratio
A 2-D PCA yields covariance eigenvalues and . What fraction of the total variance is explained by the first principal component?
Eigenvectors of a symmetric matrix
A covariance matrix is symmetric. What is guaranteed about the eigenvectors belonging to its distinct eigenvalues?
What an eigenvalue means in PCA
In PCA, the eigenvalue of the covariance matrix (for principal component ) equals which quantity?
Geometric meaning of an eigenvector
Which statement best describes an eigenvector of a matrix geometrically?
Why center before PCA?
Explain in a sentence or two why the data must be centered (mean subtracted) before forming the covariance matrix for PCA. What can the first component pick up if you forget?
Eigenvector sign ambiguity
You run the same PCA twice. The first principal component returns as one time and the next. Which explanation is correct?
Shape of the covariance matrix
Your data matrix has shape — 1000 samples, 50 features. What is the shape of the covariance matrix ?
Zero covariance vs independence
The off-diagonal entry of a covariance matrix is . Explain what this does and does not guarantee about the two features, and give a concrete example where covariance is zero yet the features are perfectly dependent.
Eigenvalues from the characteristic equation
Use the characteristic equation to find both eigenvalues of . Enter them as larger, smaller.
Implement PCA projection from scratch
Write pca_project(X, k) that centers X (shape (n, d)), forms the covariance , eigen-decomposes it with np.linalg.eigh, sorts components by descending eigenvalue, and returns the projection of the centered data onto the top k components (shape (n, k)). Test on a correlated 2-D dataset with k=1, assert the projected variance equals the top eigenvalue, and print ok.
Derive: top eigenvector maximizes variance
Show that the unit direction maximizing the projected variance subject to is an eigenvector of , and that the maximum value equals the largest eigenvalue.
Explained-variance ratio in NumPy
Given eigenvalues of a covariance matrix, write code that computes the cumulative explained-variance ratio and returns the smallest number of components needed to retain at least 90% of the variance. Test on eigenvalues [10.0, 4.0, 1.0, 0.5, 0.5], assert the answer is 3, and print ok.
When does PCA fail?
PCA finds the best linear subspace. Describe a concrete dataset whose true structure is 1-dimensional but which PCA cannot compress to one component without large error, explain geometrically why PCA fails, and name one family of methods designed to handle it.
Max variance is not always max usefulness
PCA keeps the directions of largest variance. Argue, with a concrete scenario, why the highest-variance direction can be the wrong thing to keep for a downstream classification task, and contrast PCA's objective with what a supervised method (e.g. LDA) optimizes instead.
Limit by direct substitution
Evaluate . (The function is a polynomial, so it is continuous everywhere.)
A 0/0 limit by factoring
Evaluate . (Substitution gives ; factor first.)
One-sided limits and existence
For a piecewise function, and , but . What is ?
Difference quotient of a linear function
For , the difference quotient simplifies to a constant. What is (the limit as )?
Classify the discontinuity
A step function has and . Which type of discontinuity is at ?
Continuity of ReLU at zero
Is continuous at ? Enter for yes, for no.
Limit vs. value
Which statement best captures why can exist even when is undefined?
Why ReLU is not differentiable at 0
Explain, in terms of one-sided limits of the difference quotient, why is not differentiable at even though it is continuous there. What do frameworks do at that point?
The indeterminate form in the derivative
At , the difference quotient has which form, and how is a finite derivative recovered from it?
Why the finite-difference step can't be too small
A colleague validates a gradient with and reasons: 'smaller is always closer to the true limit, so I'll use .' Explain why this makes the numerical estimate worse, not better.
Derive the derivative of a cubic
Using the limit definition , derive for . Show the cancellation explicitly.
Numerically approach a limit
Write numeric_derivative(f, a, h) returning the forward difference quotient . For at , print the estimate for , assert the estimate is within of the exact value , then print ok.
Detect a discontinuity numerically
For the step function if else , estimate the left and right limits at by evaluating at and for a shrinking sequence of . Print both estimates, assert they differ (confirming a jump discontinuity), and print ok.
Subgradients and where non-differentiability bites
Deep networks are trained with gradient descent, yet ReLU, max-pooling, and the L1 penalty are all non-differentiable at isolated points. Explain why training still works (invoke measure-zero and subgradients), then give one concrete situation where a non-differentiable point does cause real trouble and how practitioners mitigate it.
Differentiate a polynomial, evaluate at a point
Let . Using the power, sum, and constant-multiple rules, compute .
Power rule at a point
Let . Compute .
Derivative of the exponential
Let . Compute . (Recall .)
Derivative of the natural log
Let . Compute .
Constant multiple rule
Let . Compute .
Derivative of a constant
What is — the derivative of the constant function ?
Power rule with a fractional exponent
Let . Compute .
Apply the product rule
Let . Using the product rule, compute . (Use .)
Apply the quotient rule
Let . Using the quotient rule, compute .
Sensitivity and the descent direction
A loss has derivative . To decrease the loss, should you increase or decrease , and roughly how much does change if you nudge by ?
A second derivative
Let . Compute the second derivative , and state whether the curve is locally cupping upward or downward there.
Why central beats forward
The forward difference has error ; the central difference has error . For a small , which is more accurate, and by roughly what factor does the central-difference error shrink when you halve ?
Differentiate x³ from first principles
Using only the limit definition , derive the derivative of . Show the cancellation of before taking the limit.
Implement the central-difference derivative
Implement numerical_derivative(f, x, h=1e-5) using the central difference . Verify it against the analytic derivative for (which is ) and at a couple of points using np.isclose, then print ok.
Derive the quotient rule from the product rule
Assuming the product rule and the chain-rule fact , derive the quotient rule .
Sweep h to find the error minimum
For at , compute the central-difference error across . Print each error, confirm with an assert that the minimum occurs at an interior (not the smallest one), then print ok.
Designing a gradient check
You want to gradient-check a hand-written backprop by comparing its analytic gradient against a numerical gradient . (a) Why do practitioners compare the relative error rather than the absolute error? (b) Why use the central difference rather than the forward difference here? (c) Name one failure mode where a correct analytic gradient still fails a naive gradient check, and how to avoid it.
Why h ≈ cube-root of machine epsilon
For the central difference, the total error is roughly , where the first term is truncation and the second is round-off ( is machine epsilon, ). Minimize over to explain why the optimal step is on the order of , and state what that makes the best achievable error.
Chain rule at a point
Let . Use the chain rule to compute at .
Forward pass of the graph
For decomposed as , , , run the forward pass with , , and report .
Local derivative of the square node
The square node computes . What is its local derivative evaluated at ?
Gradient with respect to the bias
For , the backward pass gives . Evaluate it at , , .
Gradient through an add node
In the graph , the gradient arriving at is . What gradient does the add node pass back to ?
Gradient through a multiply node
In the graph , the gradient arriving at is and . What gradient does the multiply node pass back to ?
Why local derivatives multiply
For with , why is a product of the two local rates rather than a sum?
Backward pass with new inputs
For run forward with , , , then use the backward pass to compute . (Recall .)
Diagnosing a vanishing gradient
A 20-layer network uses an activation whose local derivative is at most everywhere. Roughly what happens to the gradient reaching the first layer, and why?
Why cache the intermediates?
Backpropagation stores intermediate values (like in ) during the forward pass instead of recomputing them during the backward pass. In one or two sentences, explain why this matters for cost.
A variable used on two paths
In a graph, the variable feeds two different downstream nodes. When you run the backward pass, how do you combine the gradients arriving at from the two paths?
Implement backprop for (wx+b)²
Implement forward(w, x, b) returning plus a cache, and backward(cache) returning by multiplying local derivatives back through , , . Verify at that the gradients are , then print ok.
Backprop a three-input product
Consider with graph , , . Derive , , and by the backward pass, then evaluate them at , , .
Finite-difference gradient check
Write a central finite-difference checker for . Use with to approximate , , at , assert they match the analytic gradients with np.allclose, then print ok.
Simulate a vanishing-gradient product
The gradient at the first of layers is a product of local derivatives. Numerically compute the product of factors each equal to , and separately each equal to , print both, and assert the first is below and the second is above . End by printing ok.
Why do gradients vanish, and how is it fixed?
Explain, using the chain rule, why deep networks with sigmoid activations suffer vanishing gradients. Then explain mechanistically how two of {ReLU activation, residual (skip) connections, batch/layer normalization} address it — pointing to what each does to the per-layer local derivative.
Reverse-mode vs forward-mode autodiff
Backpropagation is reverse-mode automatic differentiation. For a function (many inputs, scalar loss — the neural-network case), explain why reverse mode computes all input gradients in roughly one backward sweep, whereas forward-mode autodiff would need about passes. What does this asymmetry cost, and when would forward mode actually be preferable?
Gradient of the worked example
For we found . Evaluate the gradient at the point . Enter as x, y.
A single partial derivative
Let . Compute and evaluate it at (its value does not depend on ).
Partial of a product with powers
Let . Compute and evaluate it at the point .
Gradient of a linear function
Let . Give the gradient . Enter as x, y, z.
The held variable is a constant, not zero
Let . Compute and evaluate it at the point .
What does the gradient point toward?
At a given point, the gradient points in the direction of what?
Gradient and contour lines
On a contour plot of , what is the angle between the gradient at a point and the contour line passing through that same point?
The sign in the descent step
Gradient descent updates parameters as . Why is there a minus sign?
Shape of the loss gradient
A model has parameters collected in , and is a scalar loss. What is the shape of ?
Reading a single loss partial
During training you compute for one parameter. In one or two sentences, say what this tells you about , and which way gradient descent will move it.
Spot the differentiation error
A student differentiates and writes , reasoning that the and terms 'have in them, so they're zero.' What did they get wrong, and what is the correct ?
Implement a numerical gradient
Write numerical_gradient(f, v, h=1e-5) that approximates at a point v using central differences, one coordinate at a time. Test it on at , confirm it matches the analytic answer with np.allclose, and print ok.
Why steepest ascent
Assume the directional derivative in a unit direction is . Using the geometric form of the dot product, show that increases fastest when points along .
One gradient-descent step
Minimize by gradient descent, starting at with learning rate . Recall . Compute . Enter as x, y.
Numerical gradients vs. backprop
The central-difference numerical gradient is simple and always available, yet no one trains large networks with it. Explain the cost of the numerical approach for a model with parameters, why backpropagation is dramatically cheaper, and one situation where the numerical gradient is still the right tool.
How big should the step be?
Gradient descent moves . The gradient gives the best direction, but not how far to go. Describe qualitatively what goes wrong when the learning rate is far too large, and what goes wrong when it is far too small. Why can't we just read the ideal step size off the gradient itself?
Find the critical point
Find the critical point of by solving .
One gradient-descent step by hand
Minimize with gradient descent. Starting from with learning rate , compute the value of after one update .
Classify with the second derivative
At a critical point a function satisfies and . What kind of point is it?
The update factor for a quadratic
For , one gradient-descent step is . Compute the multiplicative factor for .
Name the learning rate
In the gradient-descent update , which symbol is the learning rate?
Evaluate a gradient
For the loss , compute the gradient at .
Diagnose a diverging run
You train a model and the loss increases every step until it prints NaN. Of the following, which single change is the most likely fix?
What convexity guarantees
The loss is convex (a single bowl). Which statement is true about running gradient descent with a small enough learning rate?
Why $f' = 0$ is not enough
Gradient descent slows to a near-stop because the gradient is almost zero, yet the loss is still high. Which explanation is consistent with this?
The too-small learning rate
A colleague sets 'to be safe' and reports that after a full day of training the loss has barely moved, though it is slowly decreasing. In one or two sentences, explain what regime this is and the practical trade-off of a very small .
Why the minus sign?
The update is — note the minus sign. Why do we step along rather than ?
Implement 1-D gradient descent
Write gradient_descent(grad, x0, eta, steps) that runs gradient descent in 1-D and returns the final . Use it to minimize (whose gradient is ) starting from with for 200 steps, assert the result is within of , then print ok.
Derive the convergence condition
For , gradient descent is . Derive the exact range of learning rates for which , and identify the value of that converges fastest.
Minimizer of a general quadratic
Using the first- and second-derivative tests, derive the location of the minimum of with , and confirm it is a minimum.
Experiment: convergence vs divergence
Minimize from . Run gradient descent for 20 steps with and again with . Print each final , assert that the first converges (near ) while the second diverges (magnitude above ), then print ok.
Non-convex, yet it works
Classical optimization theory guarantees gradient descent reaches the global minimum only for convex losses. Deep-network losses are highly non-convex, so we should expect descent to get trapped in bad local minima — yet in practice large models train well. Give the honest, current explanation for why, and state one concrete consequence for how practitioners think about training (e.g. what they do or do not worry about).
Why saddles, not minima, dominate in high dimensions
A critical point () is a local minimum only if the curvature is positive in every direction. Using this fact, argue heuristically why, as the number of parameters grows, a random critical point is far more likely to be a saddle point than a local minimum. Then explain why this makes escaping such points a matter of finding a downhill direction rather than being truly stuck.
Shape of a matrix product
A has shape (3, 4) and B has shape (4, 2). What is the shape of A @ B? Enter it as a tuple, e.g. (r, c).
Dot product by hand
Compute the dot product a @ b for a = np.array([1.0, 2.0, 3.0]) and b = np.array([4.0, 5.0, 6.0]).
Mean squared error by hand
For targets y = np.array([3.0, 5.0]) and predictions yhat = np.array([1.0, 4.0]), compute the mean squared error .
Shape of a per-feature mean
X has shape (100, 4) — 100 examples, 4 features. What is the shape of X.mean(axis=0)? Enter as a tuple, e.g. (k,).
The other axis
Same X of shape (100, 4). What is the shape of X.mean(axis=1)? Enter as a tuple.
Which expression is the dot product?
For two 1-D arrays a and b of equal length, which expression computes their dot product as a single scalar?
Shape of a distance matrix
X holds points as rows, shape (N, D). What is the shape of the pairwise Euclidean distance matrix ?
Debug a bias-shaped layer
In a batched layer, X is (N, D), W is (H, D), and an engineer writes b = np.zeros((H, 1)), then Y = X @ W.T + b. The output shape is not (N, H) as expected. What shape does Y actually have, why, and what is the one-character-idea fix?
Shape of the difference block
X has shape (6, 3). What is the shape of X[:, None, :] - X[None, :, :] (the broadcast used to build a distance matrix)? Enter as a tuple, e.g. (a, b, c).
Why guard the standard deviation?
Standardization computes Z = (X - mu) / sd with sd = X.std(axis=0). Explain in one or two sentences what happens if a feature column is constant, and how the guard sd = np.where(sd > 0, sd, 1.0) prevents it without distorting the data.
Loop vs vectorized: what actually gets faster?
An engineer replaces a Python for-loop dot product with a @ b and sees a large speedup. Which statement best explains why?
Implement per-column standardization
Write standardize(X) for a batch X of shape (N, D) that returns (X - mu) / sd using per-column statistics (axis=0), guarding against zero standard deviation. Verify the output has zero mean and unit std per column, assert the output shape equals the input shape, then print ok.
Batched linear layer: loop vs vectorized
Implement layer_loop(X, W, b) with explicit loops and layer_vec(X, W, b) as X @ W.T + b, for X of shape (N, D), W of shape (H, D), and b of shape (H,). Use a fixed seed, assert both outputs are (N, H) and agree with np.allclose, then print ok.
Debug a matmul shape error
An engineer has inputs X of shape (N, D) and weights W of shape (H, D) and writes Y = X @ W. It raises ValueError: matmul: ... (N,D) and (H,D). Explain precisely why matmul rejects these shapes, give the corrected expression, and state the resulting shape.
Vectorized Euclidean distance matrix
Implement dist_matrix(X) for X of shape (N, D) returning the (N, N) matrix of pairwise Euclidean distances, using broadcasting (no Python loop over pairs). Verify the diagonal is zero and the matrix is symmetric, assert the shape, then print ok.
When vectorization costs too much memory
The broadcast distance matrix materializes an (N, N, D) difference block. For and in float64, estimate that block's memory, explain why the fully vectorized one-liner becomes impractical, and describe a strategy that keeps most of the vectorization speed without allocating the full block.
Choosing the step size for a central difference
A central difference has truncation error , which suggests taking as small as possible. Yet in float64 a tiny makes the estimate worse. Explain the two competing error sources, why the total error is minimized at an intermediate (roughly near machine epsilon ), and what this implies for verifying analytic gradients in ML.
Predict for one example
A linear model has weights and bias . For the feature vector , compute the prediction .
Compute the MSE
Predictions are and targets are . Compute the mean squared error .
The residual vector
With and , compute the residual vector . Enter as a, b, c.
Bias gradient by hand
For examples the residuals are . Compute the bias gradient .
Weight gradient by hand (one feature)
A single-feature dataset has (so , ) and residuals . Compute the weight gradient .
When is the MSE zero?
The mean squared error equals exactly in which situation?
Shape of the weight gradient
The feature matrix has shape ( examples, features). What is the shape of ?
Why the transpose?
In the weight gradient we multiply the length- residual by , not by (where is ). Which multiplication produces a length- result?
What a large learning rate does
During gradient descent you notice the loss increasing each epoch, eventually reaching inf. In one or two sentences, explain the most likely cause and the first fix to try.
Reading the sign of the bias gradient
At some point during training is positive. What does this say about the current predictions, and which way will the update move ?
Why GD and the closed form agree
For linear regression, well-tuned gradient descent converges to the same weights as the normal equations . What property of the MSE loss guarantees this, and why would the guarantee fail for a general neural network?
Implement predict, mse, and gradients
Implement predict(X, w, b), mse(yhat, y), and gradients(X, y, w, b) returning (dw, db) exactly per the formulas. Then, on a seeded random dataset, verify with assert that dw has shape (d,), that db is a scalar, and print ok.
One gradient-descent step by hand
Start from , on the dataset , . The gradients there are and . With learning rate , compute the updated weight .
Derive the bias gradient
Starting from with , derive using the chain rule. State explicitly.
Finite-difference gradient check
You have an analytic gradients(X, y, w, b). Write a central-difference gradient check that perturbs each parameter by , estimates the gradient numerically as , and asserts the max absolute difference from the analytic gradient is < 1e-5. Print ok.
Normal equations vs. gradient descent
You must fit a least-squares model in two scenarios: (a) examples, features; (b) examples streamed from disk, sparse features. For each, argue which of the normal equations or gradient descent you would use, citing the cost that dominates. Then name one situation where the normal equations are not merely slow but impossible.
Deriving the ridge-regression gradient
Ridge regression minimizes with . (i) Derive . (ii) Show that setting it to zero gives the modified normal equations (ignore the bias / assume centered data). (iii) Explain in one sentence why this fixes a singular .