Linear Regression from Scratch
Everything so far, assembled into a learning algorithm
Prerequisites
Learning objectives
- Define the linear model and MSE loss with explicit shapes
- Derive the gradients of MSE with respect to w and b
- Implement and run batch gradient descent
- Verify gradients with finite differences and compare to closed form
Everything so far, assembled into one model
For nineteen chapters we have built pieces. A vector is a list of numbers; a matrix stacks vectors into a table; a matrix–vector product applies a linear map; a dot product is a weighted sum. Then calculus gave us the derivative as a rate of change, the chain rule for composing rates, the gradient as the vector of partials, and gradient descent as the loop that walks downhill on a loss surface. Each was interesting on its own. This chapter is where they click together into the first real learning algorithm: linear regression, trained from scratch.
Nothing new is required — that is the point. If you can multiply , measure error with a squared distance, differentiate that error, and step against the gradient, you can train a model. And the shape of what we build here — a prediction function, a scalar loss, its gradient, and a descent loop — is the exact template behind every supervised model you will ever train, up to and including deep networks. Master this one and the rest are variations on the theme.
Intuition: a straight line that fits a cloud of points
Imagine a scatter of points, each an input paired with an output : square footage vs. price, dose vs. response, hours studied vs. score. Linear regression draws the straight line (in higher dimensions, the flat hyperplane) that passes as close as possible to all of them at once. "As close as possible" needs a definition, and we will pick the mean squared error — the average of the squared vertical gaps between the line and the points.
Two knobs control the line: its slope(s) (how steeply the prediction rises per unit of each feature) and its intercept (where it crosses when all features are zero). Training means turning those two knobs until the average squared gap is as small as it can be. Gradient descent is how we turn them: the gradient tells us which way each knob must move to shrink the error, and we take small steps until the error stops dropping.
Drag the points, or the line, and watch the error change. Notice that the best-fit line is a balance: pushing it to hug one point better pulls it away from others, and the minimum-error line is the compromise where no small nudge helps. That balance point is exactly what the math below computes in closed form.
The formal model and loss
We have training examples, each with features. Stack them: row of the matrix is the feature vector of example .
Each prediction is just a neuron from Chapter 7: a dot product of the weights with one row of , plus a bias. The model stacks of them, one per example, and computes them all at once with a single matrix–vector product.
The squaring does two jobs: it makes every error positive (so over- and under-shooting both count as error and cannot cancel), and it punishes large misses far more than small ones. It is also smooth, which is what lets us differentiate it.
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| Feature matrix, one example per row | matrix | n×d | fixed | |
| Weight vector (one weight per feature) | vector | d | learned | |
| Bias / intercept (scalar) | scalar | 1 | learned | |
| True targets | vector | n | fixed | |
| Predictions, X\mathbf{w}+b | vector | n | computed | |
| Residual vector | vector | n | computed | |
| Mean squared error (loss) | scalar | 1 | computed | |
| Gradient of L w.r.t. weights | vector | d | computed | |
| Gradient of L w.r.t. bias | scalar | 1 | computed | |
| Number of training examples | integer | 1 | fixed | |
| Number of features | integer | 1 | fixed | |
| Learning rate (step size) | scalar | 1 | fixed |
The shape column is not decoration — it is your first line of defense against
bugs. Every equation below must respect it: a length- gradient updates a
length- weight vector, and a scalar gradient updates the scalar bias. When code
breaks, a shape mismatch is almost always the reason.
A tiny numerical example
Before differentiating anything, let us make the model concrete with numbers we can check by hand. Take three examples, one feature each (, ):
The predictions are , so the residuals are . The loss is
The true line here is obviously , so our current slope is too shallow. The gradient we derive next should therefore be negative in , telling descent to increase the slope. Hold onto the numbers ; we will finish this example the moment we have the formula.
Deriving the gradients
To train with gradient descent we need (length ) and (a scalar). Both come straight from the chain rule of Chapter 15. The trick is to name the residual and differentiate through it.
Now finish the tiny example. With residuals and :
Both are negative, as predicted: one gradient-descent step with rate pushes and , steepening the line toward the true . The math agrees with the intuition.
ML use case: this is the template for all supervised training
Look at what we assembled and notice how little of it is specific to a line:
- A model with parameters .
- A scalar loss measuring how wrong the predictions are.
- The gradient , obtained by the chain rule.
- A descent loop: , repeat.
Swap the model for a deep network, swap MSE for cross-entropy, and hand step 3 to an autodiff engine instead of pen and paper — and you have exactly how GPT-scale models are trained. The loop is identical; only the pieces plugged into it grow. Linear regression is the smallest example where every part is visible and checkable by hand, which is why it is the right place to build the mental model once, correctly.
In the gradient-descent lab, watch the loss curve fall as parameters slide toward the minimum. Try a learning rate that is too large and see the loss diverge — a failure mode we will warn about below and that you will meet again in every model you train.
NumPy: the whole thing from scratch
Here is linear regression end to end: predict, mse, and gradients exactly as
derived; a finite-difference gradient check that must agree with the analytic
gradient to better than 1e-5; a batch gradient-descent loop that records the
loss; and the closed-form solution for comparison. Everything is seeded and
shape-asserted. Run it:
Three things are worth pausing on. First, gradients is a direct transcription of
the boxed formulas — X.T @ resid is ,
and the 2.0 / m is the factor the derivation forced. Second, the gradient check
is how you gain confidence in any hand-derived gradient: perturb each parameter,
measure the loss change numerically, and demand it match the analytic value.
Third, GD and the closed form land on the same answer because MSE is convex —
it has a single global minimum, and both roads lead there.
When to use which: gradient descent vs. the normal equations
The normal equations give the exact optimum in one shot with no learning rate to tune — provided is invertible and small enough to factor. That inversion costs roughly and needs the whole dataset in memory, so it is ideal when is modest (say, up to a few thousand features) and the data fits comfortably.
Gradient descent is the workhorse everywhere else: it scales to enormous and , works in mini-batches when data will not fit in memory, tolerates near-singular , and — crucially — is the only option once the model is nonlinear and no closed form exists. That is why every neural network is trained by descent. Linear regression is the rare case where you can hold the exact answer in one hand and the iterative approximation in the other and watch them agree.
Mean squared error — full depth
The regression loss, revisited now that you can derive its gradient. It, or a close cousin, defines what almost every regression model is optimizing.
Work through these steps:
- Identify every symbol.
- State the type of every object (scalar, vector, matrix, index, set, function).
- State the dimensions / shapes.
- Rewrite the equation in plain English.
- Expand it for a tiny concrete example.
- Identify the assumptions.
- Convert it to pseudocode.
- Implement it in NumPy.
- Explain its machine-learning purpose.
The weight gradient
The gradient that drives every training step for a linear model. Written with the prediction expanded as X w + b so the residual is explicit.
Work through these steps:
- Identify every symbol.
- State the type of every object (scalar, vector, matrix, index, set, function).
- State the dimensions / shapes.
- Rewrite the equation in plain English.
- Expand it for a tiny concrete example.
- Identify the assumptions.
- Convert it to pseudocode.
- Implement it in NumPy.
- Explain its machine-learning purpose.
Summary
- The linear model is , with of shape , of length , a scalar broadcast over the length- prediction — every prediction is a neuron's dot product plus a bias.
- The MSE loss is a single scalar that is smooth, convex, and zero only on a perfect fit.
- The chain rule gives the gradients
(length
) and
(scalar). Verify any gradient with a finite-difference check to
< 1e-5. - Batch gradient descent repeats ; this model-loss-gradient-loop template is how all supervised models train.
- The normal equations give the exact optimum when is small and is invertible; gradient descent wins at scale and is mandatory for nonlinear models.
Active recall
Answer from memory before checking the lesson:
- Write in terms of , , , and give the shape of each object and of .
- State the MSE loss as a sum, and say what shape it has.
- Derive from using the chain rule. Why does the answer contain rather than ?
- What shape is , and how do you confirm your code computes it correctly without pen and paper?
- Give one situation where you would prefer the normal equations and one where you must use gradient descent.
You now have every ingredient of the Final Project: a model, a loss, hand-derived gradients, a training loop, and a gradient check. The project asks you to assemble exactly these pieces on real data — this chapter is your reference implementation.
Exercises
Level ARecall & basic calculation
Predict for one example
A linear model has weights and bias . For the feature vector , compute the prediction .
Compute the MSE
Predictions are and targets are . Compute the mean squared error .
The residual vector
With and , compute the residual vector . Enter as a, b, c.
Bias gradient by hand
For examples the residuals are . Compute the bias gradient .
Weight gradient by hand (one feature)
A single-feature dataset has (so , ) and residuals . Compute the weight gradient .
When is the MSE zero?
The mean squared error equals exactly in which situation?
Level BConceptual understanding
Shape of the weight gradient
The feature matrix has shape ( examples, features). What is the shape of ?
Why the transpose?
In the weight gradient we multiply the length- residual by , not by (where is ). Which multiplication produces a length- result?
What a large learning rate does
During gradient descent you notice the loss increasing each epoch, eventually reaching inf. In one or two sentences, explain the most likely cause and the first fix to try.
Reading the sign of the bias gradient
At some point during training is positive. What does this say about the current predictions, and which way will the update move ?
Why GD and the closed form agree
For linear regression, well-tuned gradient descent converges to the same weights as the normal equations . What property of the MSE loss guarantees this, and why would the guarantee fail for a general neural network?
Level CDerivation & implementation
Implement predict, mse, and gradients
Implement predict(X, w, b), mse(yhat, y), and gradients(X, y, w, b) returning (dw, db) exactly per the formulas. Then, on a seeded random dataset, verify with assert that dw has shape (d,), that db is a scalar, and print ok.
One gradient-descent step by hand
Start from , on the dataset , . The gradients there are and . With learning rate , compute the updated weight .
Derive the bias gradient
Starting from with , derive using the chain rule. State explicitly.
Finite-difference gradient check
You have an analytic gradients(X, y, w, b). Write a central-difference gradient check that perturbs each parameter by , estimates the gradient numerically as , and asserts the max absolute difference from the analytic gradient is < 1e-5. Print ok.
Level DResearch-thinking challenge
Normal equations vs. gradient descent
You must fit a least-squares model in two scenarios: (a) examples, features; (b) examples streamed from disk, sparse features. For each, argue which of the normal equations or gradient descent you would use, citing the cost that dominates. Then name one situation where the normal equations are not merely slow but impossible.
Deriving the ridge-regression gradient
Ridge regression minimizes with . (i) Derive . (ii) Show that setting it to zero gives the modified normal equations (ignore the bias / assume centered data). (iii) Explain in one sentence why this fixes a singular .