Capstone Project
From Equation to Implementation: Reproducing Linear Regression
Assemble everything — vectors, matrices, derivatives, gradients, and NumPy — into a working learning algorithm you fully understand.
Why this project
Every idea in this course — vectors, matrices, dot products, derivatives, partial derivatives, gradients, and the mechanics of NumPy — exists so that you can build a model that learns from data. Linear regression is the simplest model where all of these pieces come together, and it is the honest foundation underneath neural networks, logistic regression, and most of classical machine learning. If you can derive it on paper and reproduce it in code from scratch, you understand the machinery, not just the API call.
In this capstone you will implement linear regression end to end with NumPy only — no scikit-learn, no PyTorch, no TensorFlow. You will derive the loss and its gradients by hand, translate those equations into vectorized code, verify your gradients numerically, train with gradient descent, and study how the learning rate and the data itself shape the outcome.
The model
Given examples, each with features, we predict a real number per example with an affine model:
Written per example, .
The loss
We measure error with mean squared error (MSE):
Training means choosing and to make small.
Every variable and its shape
- — feature matrix, shape . Row is example ; column is feature .
- — target vector, shape . The true value for each example.
- — weight vector, shape . One weight per feature.
- — bias (intercept), a scalar. Shifts the whole prediction.
- — prediction vector, shape , equal to .
- — residual vector, shape .
- — the loss, a single scalar.
- (written ) — gradient w.r.t. weights, shape .
- (written ) — gradient w.r.t. bias, a scalar.
- — the learning rate, a positive scalar step size.
Keeping shapes straight is the single most important habit here: contracts the axis, leaving ; broadcasting keeps it ; and contracts the axis, giving back a gradient.
The gradients you will derive
Differentiating and using the chain rule with gives the two results that this whole project rests on:
Gradient descent then repeatedly steps downhill:
The plan
You will build this up one verifiable step at a time:
- Implement prediction — code and confirm it returns shape .
- Derive MSE — write the loss and implement it as a scalar.
- Derive gradients — differentiate by hand to get and .
- Implement gradient descent — one update step for and .
- Write the training loop — repeat the step for many epochs, recording the loss.
- Gradient check — compare your analytic gradients to a central finite-difference estimate; the max absolute difference must be below .
- Plot predictions — scatter vs (or fit line on 1-D data).
- Plot the loss curve — per epoch should fall smoothly.
- Experiment with learning rates — watch too-small crawl, good converge, and too-large diverge.
- Add multiple features — confirm everything generalizes from to with no code changes.
- Analyze failure cases — high noise, bad scaling, collinear features, exploding learning rates.
- Write the report — a short research-style write-up with concrete numbers.
Compare your gradient-descent solution against the closed-form normal equations to prove your optimizer converges to the right place.
Learning objectives
- Derive the MSE loss and its gradients with respect to w and b by hand, tracking the shape of every quantity.
- Translate those equations into correct, vectorized NumPy using X @ w + b and X.T @ (yhat - y).
- Verify analytic gradients against a central finite-difference estimate to within 1e-5.
- Implement batch gradient descent and a training loop that records a decreasing loss history.
- Study how the learning rate controls convergence versus divergence, and how data scaling and noise affect the fit.
- Cross-check the gradient-descent solution against the closed-form normal equations and report results clearly.
Milestones
0/15 doneDataset generator
A reproducible synthetic dataset so your results are checkable. Run it to see the data you'll fit.
Starter files
Fill in the functions marked TODO. Each has a docstring stating the exact shapes.
"""Starter file for the linear-regression capstone.
Fill in every function marked TODO. Shapes are stated in each docstring.
Conventions (n examples, d features):
X : (n, d) w : (d,) b : scalar y, yhat : (n,)
"""
import numpy as np
def predict(X, w, b):
"""Return yhat = X @ w + b.
X : (n, d) w : (d,) b : scalar -> yhat : (n,)
"""
# TODO: implement the linear model yhat = X @ w + b
raise NotImplementedError
def mse(yhat, y):
"""Mean squared error (1/n) * sum((yhat - y) ** 2).
yhat : (n,) y : (n,) -> scalar
"""
# TODO: implement the mean squared error
raise NotImplementedError
def gradients(X, y, w, b):
"""Analytic gradients of MSE.
X : (n, d) y : (n,) w : (d,) b : scalar
Returns (dw, db) with:
dw = (2/n) * X.T @ (yhat - y) shape (d,)
db = (2/n) * sum(yhat - y) scalar
"""
# TODO: compute residual yhat - y, then dw and db
raise NotImplementedError
def train(X, y, lr=0.1, epochs=1000):
"""Full-batch gradient descent.
X : (n, d) y : (n,)
Returns (w, b, loss_history) with loss_history length epochs + 1.
Update rule: w <- w - lr * dw ; b <- b - lr * db
"""
# TODO: initialise w = zeros(d), b = 0.0, then loop epochs times
raise NotImplementedErrorValidation tests
Your implementation is correct when these pass — including a finite-difference gradient check.
Expected outputs
When your implementation is correct you should observe:
- A decreasing loss curve. Starting from , the initial MSE is
large (tens, on the default dataset). With a good learning rate it drops quickly for the first few dozen epochs and then flattens toward a small floor set by the noise — on the order of for noise standard deviation .
- Recovered parameters close to the truth. The learned should match the
true weights and the learned the true bias to two or three decimals. The remaining gap is noise, not error.
- A passing gradient check. The maximum absolute difference between your
analytic gradient and the central finite-difference estimate should be roughly to , and always below the threshold.
- High on clean data. With small noise the coefficient of determination
should sit above , i.e. the model explains essentially all the variance.
- Agreement with the closed form. Gradient descent, run long enough with a
sensible learning rate, should land within about of the normal-equations solution for every weight and the bias.
Every runnable script in this project ends by printing .
Rubric
Correct derivation — the report shows and derived with the chain rule, with shapes stated.
Working vectorized code — predict, mse, gradients, and the training loop use NumPy array operations (no Python loops over examples) and return the correct shapes.
Passing gradient check — a central finite-difference test confirms analytic gradients agree to within .
Convergence evidence — a loss curve and recovered parameters demonstrate training reduces MSE and recovers the true on clean data.
Experiments — learning-rate sweep, multiple-feature run, and at least three failure cases are documented with concrete numbers and explanations.
Clear report — Abstract, Setup, Method, Results, Discussion, and Conclusion are present, numeric, and honest about limitations.
Common mistakes
Forgetting the factor of in the gradients (it comes from differentiating the square) — the gradient check will fail if you drop or double it.
Averaging with the wrong or summing instead of averaging, so the gradient scale is off and convergence behaves strangely.
Transposing wrongly: the weight gradient must be , shape , not which has the wrong shape.
Mutating w in place (for example w -= lr dw) so that history entries or the gradient check alias the same array; prefer w = w - lr dw or explicit copies.
Using a learning rate that is too large, causing the loss to increase or overflow to NaN, and concluding the math is wrong when it is really the step size.
Not scaling features: when columns of have very different magnitudes, one learning rate cannot suit all weights and convergence stalls.
Extensions
Swap the L2 (MSE) loss for an L1 (mean absolute error) loss, derive its subgradient, and compare robustness to outliers.
Add feature scaling (standardize each column to zero mean and unit variance) and measure how it speeds up convergence.
Sweep the noise standard deviation and plot how final MSE and degrade as noise grows.
Inject a few gross outliers and quantify how much they distort the L2 fit versus an L1 fit.
Grow the feature count and confirm the vectorized code is unchanged; compare gradient-descent time to the closed form.
Compare against the closed-form normal equations and discuss numerical stability, ill-conditioning, and when the closed form is preferable to gradient descent.
Reference implementation
Try the project yourself first. When you're ready to compare, reveal the complete, gradient-checked reference solution.