Final Project · ML Math Foundations

Why this project

Every idea in this course — vectors, matrices, dot products, derivatives, partial derivatives, gradients, and the mechanics of NumPy — exists so that you can build a model that learns from data. Linear regression is the simplest model where all of these pieces come together, and it is the honest foundation underneath neural networks, logistic regression, and most of classical machine learning. If you can derive it on paper and reproduce it in code from scratch, you understand the machinery, not just the API call.

In this capstone you will implement linear regression end to end with NumPy only — no scikit-learn, no PyTorch, no TensorFlow. You will derive the loss and its gradients by hand, translate those equations into vectorized code, verify your gradients numerically, train with gradient descent, and study how the learning rate and the data itself shape the outcome.

The model

Given $n$ examples, each with $d$ features, we predict a real number per example with an affine model:

\hat{y} = X w + b

Written per example, $\hat{y}_i = \sum_{j=1}^{d} X_{ij} w_j + b$ .

The loss

We measure error with mean squared error (MSE):

L(w, b) = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

Training means choosing $w$ and $b$ to make $L$ small.

Every variable and its shape

$X$ — feature matrix, shape $(n, d)$ . Row $i$ is example $i$ ; column $j$ is feature $j$ .
$y$ — target vector, shape $(n,)$ . The true value for each example.
$w$ — weight vector, shape $(d,)$ . One weight per feature.
$b$ — bias (intercept), a scalar. Shifts the whole prediction.
$\hat{y}$ — prediction vector, shape $(n,)$ , equal to $X w + b$ .
$r = \hat{y} - y$ — residual vector, shape $(n,)$ .
$L$ — the loss, a single scalar.
$\frac{\partial L}{\partial w}$ (written $dw$ ) — gradient w.r.t. weights, shape $(d,)$ .
$\frac{\partial L}{\partial b}$ (written $db$ ) — gradient w.r.t. bias, a scalar.
$\eta$ — the learning rate, a positive scalar step size.

Keeping shapes straight is the single most important habit here: $X w$ contracts the $d$ axis, leaving $(n,)$ ; broadcasting $+ b$ keeps it $(n,)$ ; and $X^{\top} r$ contracts the $n$ axis, giving back a $(d,)$ gradient.

The gradients you will derive

Differentiating $L = \frac{1}{n} \sum_i (\hat{y}_i - y_i)^2$ and using the chain rule with $\hat{y}_i = \sum_j X_{ij} w_j + b$ gives the two results that this whole project rests on:

\frac{\partial L}{\partial w} = \frac{2}{n} X^{\top} (\hat{y} - y) \quad \text{shape } (d,)

\frac{\partial L}{\partial b} = \frac{2}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i) \quad \text{scalar}

Gradient descent then repeatedly steps downhill:

w \leftarrow w - \eta \, \frac{\partial L}{\partial w}, \qquad b \leftarrow b - \eta \, \frac{\partial L}{\partial b}

The plan

You will build this up one verifiable step at a time:

Implement prediction — code $\hat{y} = X w + b$ and confirm it returns shape $(n,)$ .
Derive MSE — write the loss and implement it as a scalar.
Derive gradients — differentiate $L$ by hand to get $dw$ and $db$ .
Implement gradient descent — one update step for $w$ and $b$ .
Write the training loop — repeat the step for many epochs, recording the loss.
Gradient check — compare your analytic gradients to a central finite-difference estimate; the max absolute difference must be below $10^{-5}$ .
Plot predictions — scatter $y$ vs $\hat{y}$ (or fit line on 1-D data).
Plot the loss curve — $L$ per epoch should fall smoothly.
Experiment with learning rates — watch too-small crawl, good converge, and too-large diverge.
Add multiple features — confirm everything generalizes from $d = 1$ to $d > 1$ with no code changes.
Analyze failure cases — high noise, bad scaling, collinear features, exploding learning rates.
Write the report — a short research-style write-up with concrete numbers.

Compare your gradient-descent solution against the closed-form normal equations to prove your optimizer converges to the right place.

When your implementation is correct you should observe:

A decreasing loss curve. Starting from $w = 0$ , $b = 0$ the initial MSE is

large (tens, on the default dataset). With a good learning rate it drops quickly for the first few dozen epochs and then flattens toward a small floor set by the noise — on the order of $10^{-2}$ for noise standard deviation $0.1$ .

Recovered parameters close to the truth. The learned $w$ should match the

true weights $[2.0, -3.0, 0.5]$ and the learned $b$ the true bias $4.0$ to two or three decimals. The remaining gap is noise, not error.

A passing gradient check. The maximum absolute difference between your

analytic gradient and the central finite-difference estimate should be roughly $10^{-8}$ to $10^{-10}$ , and always below the $10^{-5}$ threshold.

High $R^2$ on clean data. With small noise the coefficient of determination

$R^2 = 1 - \mathrm{SS_{res}} / \mathrm{SS_{tot}}$ should sit above $0.99$ , i.e. the model explains essentially all the variance.

Agreement with the closed form. Gradient descent, run long enough with a

sensible learning rate, should land within about $10^{-2}$ of the normal-equations solution for every weight and the bias.

Every runnable script in this project ends by printing $ok$ .

From Equation to Implementation: Reproducing Linear Regression

Why this project

The model

The loss

Every variable and its shape

The gradients you will derive

The plan

Learning objectives

Milestones

Dataset generator

Starter files

Validation tests

Expected outputs

Rubric

Common mistakes

Extensions

Reference implementation

Example research-style report