Linear Regression from Scratch

Everything so far, assembled into one model

For nineteen chapters we have built pieces. A vector is a list of numbers; a matrix stacks vectors into a table; a matrix–vector product applies a linear map; a dot product is a weighted sum. Then calculus gave us the derivative as a rate of change, the chain rule for composing rates, the gradient as the vector of partials, and gradient descent as the loop that walks downhill on a loss surface. Each was interesting on its own. This chapter is where they click together into the first real learning algorithm: linear regression, trained from scratch.

Nothing new is required — that is the point. If you can multiply $X\mathbf{w}$ , measure error with a squared distance, differentiate that error, and step against the gradient, you can train a model. And the shape of what we build here — a prediction function, a scalar loss, its gradient, and a descent loop — is the exact template behind every supervised model you will ever train, up to and including deep networks. Master this one and the rest are variations on the theme.

Intuition: a straight line that fits a cloud of points

Imagine a scatter of points, each an input $x$ paired with an output $y$ : square footage vs. price, dose vs. response, hours studied vs. score. Linear regression draws the straight line (in higher dimensions, the flat hyperplane) that passes as close as possible to all of them at once. "As close as possible" needs a definition, and we will pick the mean squared error — the average of the squared vertical gaps between the line and the points.

Two knobs control the line: its slope(s) $\mathbf{w}$ (how steeply the prediction rises per unit of each feature) and its intercept $b$ (where it crosses when all features are zero). Training means turning those two knobs until the average squared gap is as small as it can be. Gradient descent is how we turn them: the gradient tells us which way each knob must move to shrink the error, and we take small steps until the error stops dropping.

Interactive LabLinear-Regression Laboratory

Loading interactive lab…

Drag the points, or the line, and watch the error change. Notice that the best-fit line is a balance: pushing it to hug one point better pulls it away from others, and the minimum-error line is the compromise where no small nudge helps. That balance point is exactly what the math below computes in closed form.

The formal model and loss

We have $n$ training examples, each with $d$ features. Stack them: row $i$ of the matrix $X$ is the feature vector of example $i$ .

Each prediction $\hat{y}_i$ is just a neuron from Chapter 7: a dot product of the weights with one row of $X$ , plus a bias. The model stacks $n$ of them, one per example, and computes them all at once with a single matrix–vector product.

The squaring does two jobs: it makes every error positive (so over- and under-shooting both count as error and cannot cancel), and it punishes large misses far more than small ones. It is also smooth, which is what lets us differentiate it.

Symbol	Meaning	Type	Shape	Role
$X$	Feature matrix, one example per row	matrix	n×d	fixed
$\mathbf{w}$	Weight vector (one weight per feature)	vector	d	learned
$b$	Bias / intercept (scalar)	scalar	1	learned
$\mathbf{y}$	True targets	vector	n	fixed
$\hat{\mathbf{y}}$	Predictions, X\mathbf{w}+b	vector	n	computed
$\hat{\mathbf{y}} - \mathbf{y}$	Residual vector	vector	n	computed
$L$	Mean squared error (loss)	scalar	1	computed
$\nabla_{\mathbf{w}} L$	Gradient of L w.r.t. weights	vector	d	computed
$\partial L / \partial b$	Gradient of L w.r.t. bias	scalar	1	computed
$n$	Number of training examples	integer	1	fixed
$d$	Number of features	integer	1	fixed
$\eta$	Learning rate (step size)	scalar	1	fixed

The shape column is not decoration — it is your first line of defense against bugs. Every equation below must respect it: a length- $d$ gradient updates a length- $d$ weight vector, and a scalar gradient updates the scalar bias. When code breaks, a shape mismatch is almost always the reason.

A tiny numerical example

Before differentiating anything, let us make the model concrete with numbers we can check by hand. Take three examples, one feature each ( $n = 3$ , $d = 1$ ):

X = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \qquad \mathbf{y} = \begin{bmatrix} 2 \\ 4 \\ 6 \end{bmatrix}, \qquad \mathbf{w} = \begin{bmatrix} 1 \end{bmatrix}, \qquad b = 0.

The predictions are $\hat{\mathbf{y}} = X\mathbf{w} + b = (1, 2, 3)$ , so the residuals are $\hat{\mathbf{y}} - \mathbf{y} = (1{-}2,\ 2{-}4,\ 3{-}6) = (-1, -2, -3)$ . The loss is $L = \tfrac{1}{3}\bigl[(-1)^2 + (-2)^2 + (-3)^2\bigr] = \tfrac{1}{3}(1 + 4 + 9) = \tfrac{14}{3} \approx 4.667.$

The true line here is obviously $y = 2x$ , so our current slope $w = 1$ is too shallow. The gradient we derive next should therefore be negative in $w$ , telling descent to increase the slope. Hold onto the numbers $(-1, -2, -3)$ ; we will finish this example the moment we have the formula.

Deriving the gradients

To train with gradient descent we need $\nabla_{\mathbf{w}} L$ (length $d$ ) and $\partial L / \partial b$ (a scalar). Both come straight from the chain rule of Chapter 15. The trick is to name the residual and differentiate through it.

the MSE gradient, step by step

Write the residual of example $i$ as $r_i = \hat{y}_i - y_i$ , so that $L = \frac{1}{n}\sum_{i=1}^{n} r_i^2.$

Step 1 — differentiate the loss w.r.t. a residual. Treating each $r_i$ as a variable, the outer function is a sum of squares: $\frac{\partial L}{\partial r_i} = \frac{1}{n}\cdot 2 r_i = \frac{2}{n} r_i.$

Step 2 — differentiate a residual w.r.t. a single weight. Since $r_i = \bigl(\sum_{k} X_{ik} w_k + b\bigr) - y_i$ , only the $k=j$ term depends on $w_j$ , so $\frac{\partial r_i}{\partial w_j} = X_{ij}.$

Step 3 — chain them and sum over examples. The weight $w_j$ influences $L$ through every residual, so we sum the chain-rule products over $i$ :

\frac{\partial L}{\partial w_j} = \sum_{i=1}^{n} \frac{\partial L}{\partial r_i}\,\frac{\partial r_i}{\partial w_j} = \sum_{i=1}^{n} \frac{2}{n} r_i\, X_{ij} = \frac{2}{n}\sum_{i=1}^{n} X_{ij}\, r_i.

Step 4 — recognize the sum as a matrix product. The sum $\sum_i X_{ij} r_i$ is exactly the $j$ -th entry of $X^\top \mathbf{r}$ (row $j$ of $X^\top$ dotted with $\mathbf{r}$ ). Collecting all $d$ components into one vector, $\boxed{\;\nabla_{\mathbf{w}} L = \frac{2}{n}\,X^\top(\hat{\mathbf{y}} - \mathbf{y})\;}$ which has shape $(d \times n)(n) = d$ — one number per weight, exactly as required.

Step 5 — the bias. The bias enters every residual with $\partial r_i / \partial b = 1$ , so

\frac{\partial L}{\partial b} = \sum_{i=1}^{n} \frac{\partial L}{\partial r_i}\,\frac{\partial r_i}{\partial b} = \sum_{i=1}^{n} \frac{2}{n} r_i \cdot 1 = \boxed{\;\frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i - y_i)\;}

a single scalar — the average residual, times two. Intuitively: if predictions are too high on average, this is positive and descent lowers $b$ ; too low, it is negative and descent raises $b$ .

Now finish the tiny example. With residuals $(-1, -2, -3)$ and $X = (1, 2, 3)$ :

\nabla_{\mathbf{w}} L = \frac{2}{3}\bigl(1{\cdot}({-}1) + 2{\cdot}({-}2) + 3{\cdot}({-}3)\bigr) = \frac{2}{3}({-}14) = -\frac{28}{3} \approx -9.33,

$\frac{\partial L}{\partial b} = \frac{2}{3}\bigl(({-}1) + ({-}2) + ({-}3)\bigr) = \frac{2}{3}({-}6) = -4.$ Both are negative, as predicted: one gradient-descent step with rate $\eta$ pushes $w \leftarrow 1 - \eta(-9.33)$ and $b \leftarrow 0 - \eta(-4)$ , steepening the line toward the true $y = 2x$ . The math agrees with the intuition.

ML use case: this is the template for all supervised training

Look at what we assembled and notice how little of it is specific to a line:

A model $\hat{\mathbf{y}} = f_\theta(X)$ with parameters $\theta = (\mathbf{w}, b)$ .
A scalar loss $L(\theta)$ measuring how wrong the predictions are.
The gradient $\nabla_\theta L$ , obtained by the chain rule.
A descent loop: $\theta \leftarrow \theta - \eta\,\nabla_\theta L$ , repeat.

Swap the model for a deep network, swap MSE for cross-entropy, and hand step 3 to an autodiff engine instead of pen and paper — and you have exactly how GPT-scale models are trained. The loop is identical; only the pieces plugged into it grow. Linear regression is the smallest example where every part is visible and checkable by hand, which is why it is the right place to build the mental model once, correctly.

Interactive LabGradient-Descent Visualizer

Loading interactive lab…

In the gradient-descent lab, watch the loss curve fall as parameters slide toward the minimum. Try a learning rate that is too large and see the loss diverge — a failure mode we will warn about below and that you will meet again in every model you train.

NumPy: the whole thing from scratch

Here is linear regression end to end: predict, mse, and gradients exactly as derived; a finite-difference gradient check that must agree with the analytic gradient to better than 1e-5; a batch gradient-descent loop that records the loss; and the closed-form solution for comparison. Everything is seeded and shape-asserted. Run it:

linreg_from_scratch.py

import numpy as np

rng = np.random.default_rng(0)

# ---- synthetic data: y = X w_true + b_true + small noise ----
n, d = 200, 3
X = rng.standard_normal((n, d))                 # (n, d)
w_true = np.array([2.0, -1.0, 0.5])             # (d,)
b_true = 0.7
y = X @ w_true + b_true + 0.1 * rng.standard_normal(n)   # (n,)
assert X.shape == (n, d) and y.shape == (n,)

def predict(X, w, b):
  # X:(n,d)  w:(d,)  b:scalar  ->  yhat:(n,)
  return X @ w + b

def mse(yhat, y):
  return np.mean((yhat - y) ** 2)             # scalar

def gradients(X, y, w, b):
  m = y.shape[0]
  resid = predict(X, w, b) - y                # (n,)  = yhat - y
  dw = (2.0 / m) * (X.T @ resid)              # (d,)  = (2/n) X^T (yhat - y)
  db = (2.0 / m) * np.sum(resid)              # scalar= (2/n) sum(yhat - y)
  return dw, db

# ---- gradient check: analytic vs central finite differences ----
w0 = rng.standard_normal(d)
b0 = 0.3
dw, db = gradients(X, y, w0, b0)
eps = 1e-5
num_dw = np.zeros(d)
for j in range(d):
  wp = w0.copy(); wp[j] += eps
  wm = w0.copy(); wm[j] -= eps
  num_dw[j] = (mse(predict(X, wp, b0), y) - mse(predict(X, wm, b0), y)) / (2 * eps)
num_db = (mse(predict(X, w0, b0 + eps), y) - mse(predict(X, w0, b0 - eps), y)) / (2 * eps)
max_diff = max(np.max(np.abs(num_dw - dw)), abs(num_db - db))
assert max_diff < 1e-5, ("gradient check failed", max_diff)

# ---- batch gradient descent ----
w = np.zeros(d); b = 0.0
lr, epochs = 0.1, 500
loss_history = []
for _ in range(epochs):
  loss_history.append(mse(predict(X, w, b), y))
  gw, gb = gradients(X, y, w, b)
  w = w - lr * gw                             # (d,) update
  b = b - lr * gb                             # scalar update
loss_history.append(mse(predict(X, w, b), y))
assert loss_history[-1] < loss_history[0]      # loss went down

# ---- closed form: theta = (Z^T Z)^-1 Z^T y, Z has a bias column ----
Z = np.hstack([np.ones((n, 1)), X])            # (n, d+1)
theta = np.linalg.solve(Z.T @ Z, Z.T @ y)      # (d+1,)  solve, do not invert
b_cf, w_cf = theta[0], theta[1:]

# gradient descent should reach the same optimum as the closed form
assert np.allclose(w, w_cf, atol=1e-2)
assert np.isclose(b, b_cf, atol=1e-2)

print("ok")

Three things are worth pausing on. First, gradients is a direct transcription of the boxed formulas — X.T @ resid is $X^\top(\hat{\mathbf{y}} - \mathbf{y})$ , and the 2.0 / m is the factor the derivation forced. Second, the gradient check is how you gain confidence in any hand-derived gradient: perturb each parameter, measure the loss change numerically, and demand it match the analytic value. Third, GD and the closed form land on the same answer because MSE is convex — it has a single global minimum, and both roads lead there.

The four bugs that bite everyone here

Transpose direction. The weight gradient is $X^\top(\hat{\mathbf{y}} - \mathbf{y})$ , not $X(\hat{\mathbf{y}} - \mathbf{y})$ . With $X$ shaped $n \times d$ you must contract the $n$ axis, so you need $X^\top$ (shape $d \times n$ ) times the length- $n$ residual, giving length $d$ . Writing X @ resid raises a shape error or, worse, silently broadcasts — always check the result is length $d$ .
Dropping the $2/n$ . Forgetting the factor does not change where the minimum is, but it rescales every gradient, so your effective learning rate is off by a constant. Your gradient check will catch it instantly — the analytic and numeric gradients will differ by exactly that factor.
Learning rate. Too small and training crawls; too large and the loss increases and diverges to inf/nan. If your loss curve climbs, halve $\eta$ before touching anything else.
Singular $X^\top X$ . The closed form needs $X^\top X$ to be invertible. If two features are perfectly collinear (one is a copy or a scaled sum of others), $X^\top X$ is singular and np.linalg.solve fails. Gradient descent still runs, but the solution is no longer unique — a signal to remove the redundant feature or add regularization.

When to use which: gradient descent vs. the normal equations

The normal equations $\mathbf{w} = (X^\top X)^{-1} X^\top \mathbf{y}$ give the exact optimum in one shot with no learning rate to tune — provided $X^\top X$ is invertible and small enough to factor. That inversion costs roughly $O(d^3)$ and needs the whole dataset in memory, so it is ideal when $d$ is modest (say, up to a few thousand features) and the data fits comfortably.

Gradient descent is the workhorse everywhere else: it scales to enormous $n$ and $d$ , works in mini-batches when data will not fit in memory, tolerates near-singular $X^\top X$ , and — crucially — is the only option once the model is nonlinear and no closed form exists. That is why every neural network is trained by descent. Linear regression is the rare case where you can hold the exact answer in one hand and the iterative approximation in the other and watch them agree.

Research Paper Equation Practice

Mean squared error — full depth

The regression loss, revisited now that you can derive its gradient. It, or a close cousin, defines what almost every regression model is optimizing.

L = \frac{1}{n}\sum_{i=1}^{n}\left(\hat{y}_i - y_i\right)^2

Work through these steps:

Identify every symbol.
State the type of every object (scalar, vector, matrix, index, set, function).
State the dimensions / shapes.
Rewrite the equation in plain English.
Expand it for a tiny concrete example.
Identify the assumptions.
Convert it to pseudocode.
Implement it in NumPy.
Explain its machine-learning purpose.

Research Paper Equation Practice

The weight gradient

The gradient that drives every training step for a linear model. Written with the prediction expanded as X w + b so the residual is explicit.

\nabla_{\mathbf{w}} L = \frac{2}{n}\,X^\top\left(X\mathbf{w} + b - \mathbf{y}\right)

Work through these steps:

Identify every symbol.
State the type of every object (scalar, vector, matrix, index, set, function).
State the dimensions / shapes.
Rewrite the equation in plain English.
Expand it for a tiny concrete example.
Identify the assumptions.
Convert it to pseudocode.
Implement it in NumPy.
Explain its machine-learning purpose.

Summary

The linear model is $\hat{\mathbf{y}} = X\mathbf{w} + b$ , with $X$ of shape $n \times d$ , $\mathbf{w}$ of length $d$ , $b$ a scalar broadcast over the length- $n$ prediction — every prediction is a neuron's dot product plus a bias.
The MSE loss $L = \frac{1}{n}\sum_i(\hat{y}_i - y_i)^2$ is a single scalar that is smooth, convex, and zero only on a perfect fit.
The chain rule gives the gradients $\nabla_{\mathbf{w}} L = \frac{2}{n}X^\top(\hat{\mathbf{y}} - \mathbf{y})$ (length $d$ ) and $\frac{\partial L}{\partial b} = \frac{2}{n}\sum_i(\hat{y}_i - y_i)$ (scalar). Verify any gradient with a finite-difference check to < 1e-5.
Batch gradient descent repeats $\theta \leftarrow \theta - \eta\nabla_\theta L$ ; this model-loss-gradient-loop template is how all supervised models train.
The normal equations $\mathbf{w} = (X^\top X)^{-1}X^\top\mathbf{y}$ give the exact optimum when $d$ is small and $X^\top X$ is invertible; gradient descent wins at scale and is mandatory for nonlinear models.

Active recall

Answer from memory before checking the lesson:

Write $\hat{\mathbf{y}}$ in terms of $X$ , $\mathbf{w}$ , $b$ , and give the shape of each object and of $\hat{\mathbf{y}}$ .
State the MSE loss as a sum, and say what shape it has.
Derive $\partial L / \partial w_j$ from $L = \frac{1}{n}\sum_i r_i^2$ using the chain rule. Why does the answer contain $X^\top$ rather than $X$ ?
What shape is $\nabla_{\mathbf{w}} L$ , and how do you confirm your code computes it correctly without pen and paper?
Give one situation where you would prefer the normal equations and one where you must use gradient descent.

You now have every ingredient of the Final Project: a model, a loss, hand-derived gradients, a training loop, and a gradient check. The project asks you to assemble exactly these pieces on real data — this chapter is your reference implementation.

Exercises

Level ARecall & basic calculation

Level AHand calculationch20-A1

Predict for one example

A linear model has weights $\mathbf{w} = (1, -1)$ and bias $b = 0.5$ . For the feature vector $\mathbf{x} = (2, 3)$ , compute the prediction $\hat{y} = \mathbf{w}^\top\mathbf{x} + b$ .

Level AHand calculationch20-A2

Compute the MSE

Predictions are $\hat{\mathbf{y}} = (3, 5, 4)$ and targets are $\mathbf{y} = (2, 5, 6)$ . Compute the mean squared error $L = \frac{1}{n}\sum_i (\hat{y}_i - y_i)^2$ .

Level AHand calculationch20-A3

The residual vector

With $\hat{\mathbf{y}} = (2, 4, 6)$ and $\mathbf{y} = (1, 5, 5)$ , compute the residual vector $\hat{\mathbf{y}} - \mathbf{y}$ . Enter as a, b, c.

Level AHand calculationch20-A4

Bias gradient by hand

For $n = 3$ examples the residuals are $\hat{\mathbf{y}} - \mathbf{y} = (-1, -2, -3)$ . Compute the bias gradient $\frac{\partial L}{\partial b} = \frac{2}{n}\sum_i (\hat{y}_i - y_i)$ .

Level AHand calculationch20-A5

Weight gradient by hand (one feature)

A single-feature dataset has $X = (1, 2, 3)^\top$ (so $d = 1$ , $n = 3$ ) and residuals $\hat{\mathbf{y}} - \mathbf{y} = (-1, -2, -3)$ . Compute the weight gradient $\nabla_{\mathbf{w}} L = \frac{2}{n} X^\top(\hat{\mathbf{y}} - \mathbf{y})$ .

Level AEquation interpretationch20-A6

When is the MSE zero?

The mean squared error $L = \frac{1}{n}\sum_i(\hat{y}_i - y_i)^2$ equals exactly $0$ in which situation?

Level BConceptual understanding

Level BShape reasoningch20-B1

Shape of the weight gradient

The feature matrix $X$ has shape $100 \times 5$ ( $n = 100$ examples, $d = 5$ features). What is the shape of $\nabla_{\mathbf{w}} L = \frac{2}{n} X^\top(\hat{\mathbf{y}} - \mathbf{y})$ ?

Level BShape reasoningch20-B2

Why the transpose?

In the weight gradient we multiply the length- $n$ residual by $X^\top$ , not by $X$ (where $X$ is $n \times d$ ). Which multiplication produces a length- $d$ result?

Level BML applicationch20-B3

What a large learning rate does

During gradient descent you notice the loss increasing each epoch, eventually reaching inf. In one or two sentences, explain the most likely cause and the first fix to try.

Level BEquation interpretationch20-B4

Reading the sign of the bias gradient

At some point during training $\frac{\partial L}{\partial b} = \frac{2}{n}\sum_i(\hat{y}_i - y_i)$ is positive. What does this say about the current predictions, and which way will the update $b \leftarrow b - \eta\,\frac{\partial L}{\partial b}$ move $b$ ?

Level BML applicationch20-B5

Why GD and the closed form agree

For linear regression, well-tuned gradient descent converges to the same weights as the normal equations $\mathbf{w} = (X^\top X)^{-1}X^\top\mathbf{y}$ . What property of the MSE loss guarantees this, and why would the guarantee fail for a general neural network?

Level CDerivation & implementation

Level CNumPy implementationch20-C1

Implement predict, mse, and gradients

Implement predict(X, w, b), mse(yhat, y), and gradients(X, y, w, b) returning (dw, db) exactly per the formulas. Then, on a seeded random dataset, verify with assert that dw has shape (d,), that db is a scalar, and print ok.

Level CHand calculationch20-C2

One gradient-descent step by hand

Start from $w = 1$ , $b = 0$ on the dataset $X = (1,2,3)^\top$ , $\mathbf{y} = (2,4,6)$ . The gradients there are $\nabla_{\mathbf{w}} L = -\frac{28}{3}$ and $\frac{\partial L}{\partial b} = -4$ . With learning rate $\eta = 0.1$ , compute the updated weight $w_{\text{new}} = w - \eta\,\nabla_{\mathbf{w}} L$ .

Level CDerivationch20-C3

Derive the bias gradient

Starting from $L = \frac{1}{n}\sum_i r_i^2$ with $r_i = (X\mathbf{w} + b)_i - y_i$ , derive $\frac{\partial L}{\partial b} = \frac{2}{n}\sum_i(\hat{y}_i - y_i)$ using the chain rule. State $\frac{\partial r_i}{\partial b}$ explicitly.

Level CNumPy implementationch20-C4

Finite-difference gradient check

You have an analytic gradients(X, y, w, b). Write a central-difference gradient check that perturbs each parameter by $\varepsilon = 10^{-5}$ , estimates the gradient numerically as $\frac{L(\theta+\varepsilon) - L(\theta-\varepsilon)}{2\varepsilon}$ , and asserts the max absolute difference from the analytic gradient is < 1e-5. Print ok.

Level DResearch-thinking challenge

Level DPaper-reading practicech20-D1

Normal equations vs. gradient descent

You must fit a least-squares model in two scenarios: (a) $n = 10^4$ examples, $d = 50$ features; (b) $n = 10^8$ examples streamed from disk, $d = 10^5$ sparse features. For each, argue which of the normal equations $\mathbf{w} = (X^\top X)^{-1}X^\top\mathbf{y}$ or gradient descent you would use, citing the cost that dominates. Then name one situation where the normal equations are not merely slow but impossible.

Level DDerivationch20-D2

Deriving the ridge-regression gradient

Ridge regression minimizes $L_{\text{ridge}} = \frac{1}{n}\sum_i(\hat{y}_i - y_i)^2 + \lambda\lVert\mathbf{w}\rVert^2$ with $\lambda > 0$ . (i) Derive $\nabla_{\mathbf{w}} L_{\text{ridge}}$ . (ii) Show that setting it to zero gives the modified normal equations $(X^\top X + n\lambda I)\mathbf{w} = X^\top\mathbf{y}$ (ignore the bias / assume centered data). (iii) Explain in one sentence why this fixes a singular $X^\top X$ .