Part 5 · NumPy LaboratoryChapter 2095 min

Linear Regression from Scratch

Everything so far, assembled into a learning algorithm

Learning objectives

  • Define the linear model and MSE loss with explicit shapes
  • Derive the gradients of MSE with respect to w and b
  • Implement and run batch gradient descent
  • Verify gradients with finite differences and compare to closed form

Everything so far, assembled into one model

For nineteen chapters we have built pieces. A vector is a list of numbers; a matrix stacks vectors into a table; a matrix–vector product applies a linear map; a dot product is a weighted sum. Then calculus gave us the derivative as a rate of change, the chain rule for composing rates, the gradient as the vector of partials, and gradient descent as the loop that walks downhill on a loss surface. Each was interesting on its own. This chapter is where they click together into the first real learning algorithm: linear regression, trained from scratch.

Nothing new is required — that is the point. If you can multiply XwX\mathbf{w}, measure error with a squared distance, differentiate that error, and step against the gradient, you can train a model. And the shape of what we build here — a prediction function, a scalar loss, its gradient, and a descent loop — is the exact template behind every supervised model you will ever train, up to and including deep networks. Master this one and the rest are variations on the theme.

Intuition: a straight line that fits a cloud of points

Imagine a scatter of points, each an input xx paired with an output yy: square footage vs. price, dose vs. response, hours studied vs. score. Linear regression draws the straight line (in higher dimensions, the flat hyperplane) that passes as close as possible to all of them at once. "As close as possible" needs a definition, and we will pick the mean squared error — the average of the squared vertical gaps between the line and the points.

Two knobs control the line: its slope(s) w\mathbf{w} (how steeply the prediction rises per unit of each feature) and its intercept bb (where it crosses when all features are zero). Training means turning those two knobs until the average squared gap is as small as it can be. Gradient descent is how we turn them: the gradient tells us which way each knob must move to shrink the error, and we take small steps until the error stops dropping.

Interactive LabLinear-Regression Laboratory
Loading interactive lab…

Drag the points, or the line, and watch the error change. Notice that the best-fit line is a balance: pushing it to hug one point better pulls it away from others, and the minimum-error line is the compromise where no small nudge helps. That balance point is exactly what the math below computes in closed form.

The formal model and loss

We have nn training examples, each with dd features. Stack them: row ii of the matrix XX is the feature vector of example ii.

Each prediction y^i\hat{y}_i is just a neuron from Chapter 7: a dot product of the weights with one row of XX, plus a bias. The model stacks nn of them, one per example, and computes them all at once with a single matrix–vector product.

The squaring does two jobs: it makes every error positive (so over- and under-shooting both count as error and cannot cancel), and it punishes large misses far more than small ones. It is also smooth, which is what lets us differentiate it.

The shape column is not decoration — it is your first line of defense against bugs. Every equation below must respect it: a length-dd gradient updates a length-dd weight vector, and a scalar gradient updates the scalar bias. When code breaks, a shape mismatch is almost always the reason.

A tiny numerical example

Before differentiating anything, let us make the model concrete with numbers we can check by hand. Take three examples, one feature each (n=3n = 3, d=1d = 1):

X=[123],y=[246],w=[1],b=0.X = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \qquad \mathbf{y} = \begin{bmatrix} 2 \\ 4 \\ 6 \end{bmatrix}, \qquad \mathbf{w} = \begin{bmatrix} 1 \end{bmatrix}, \qquad b = 0.

The predictions are y^=Xw+b=(1,2,3)\hat{\mathbf{y}} = X\mathbf{w} + b = (1, 2, 3), so the residuals are y^y=(12, 24, 36)=(1,2,3)\hat{\mathbf{y}} - \mathbf{y} = (1{-}2,\ 2{-}4,\ 3{-}6) = (-1, -2, -3). The loss is L=13[(1)2+(2)2+(3)2]=13(1+4+9)=1434.667.L = \tfrac{1}{3}\bigl[(-1)^2 + (-2)^2 + (-3)^2\bigr] = \tfrac{1}{3}(1 + 4 + 9) = \tfrac{14}{3} \approx 4.667.

The true line here is obviously y=2xy = 2x, so our current slope w=1w = 1 is too shallow. The gradient we derive next should therefore be negative in ww, telling descent to increase the slope. Hold onto the numbers (1,2,3)(-1, -2, -3); we will finish this example the moment we have the formula.

Deriving the gradients

To train with gradient descent we need wL\nabla_{\mathbf{w}} L (length dd) and L/b\partial L / \partial b (a scalar). Both come straight from the chain rule of Chapter 15. The trick is to name the residual and differentiate through it.

Now finish the tiny example. With residuals (1,2,3)(-1, -2, -3) and X=(1,2,3)X = (1, 2, 3):

wL=23(1(1)+2(2)+3(3))=23(14)=2839.33,\nabla_{\mathbf{w}} L = \frac{2}{3}\bigl(1{\cdot}({-}1) + 2{\cdot}({-}2) + 3{\cdot}({-}3)\bigr) = \frac{2}{3}({-}14) = -\frac{28}{3} \approx -9.33,

Lb=23((1)+(2)+(3))=23(6)=4.\frac{\partial L}{\partial b} = \frac{2}{3}\bigl(({-}1) + ({-}2) + ({-}3)\bigr) = \frac{2}{3}({-}6) = -4. Both are negative, as predicted: one gradient-descent step with rate η\eta pushes w1η(9.33)w \leftarrow 1 - \eta(-9.33) and b0η(4)b \leftarrow 0 - \eta(-4), steepening the line toward the true y=2xy = 2x. The math agrees with the intuition.

ML use case: this is the template for all supervised training

Look at what we assembled and notice how little of it is specific to a line:

  1. A model y^=fθ(X)\hat{\mathbf{y}} = f_\theta(X) with parameters θ=(w,b)\theta = (\mathbf{w}, b).
  2. A scalar loss L(θ)L(\theta) measuring how wrong the predictions are.
  3. The gradient θL\nabla_\theta L, obtained by the chain rule.
  4. A descent loop: θθηθL\theta \leftarrow \theta - \eta\,\nabla_\theta L, repeat.

Swap the model for a deep network, swap MSE for cross-entropy, and hand step 3 to an autodiff engine instead of pen and paper — and you have exactly how GPT-scale models are trained. The loop is identical; only the pieces plugged into it grow. Linear regression is the smallest example where every part is visible and checkable by hand, which is why it is the right place to build the mental model once, correctly.

Interactive LabGradient-Descent Visualizer
Loading interactive lab…

In the gradient-descent lab, watch the loss curve fall as parameters slide toward the minimum. Try a learning rate that is too large and see the loss diverge — a failure mode we will warn about below and that you will meet again in every model you train.

NumPy: the whole thing from scratch

Here is linear regression end to end: predict, mse, and gradients exactly as derived; a finite-difference gradient check that must agree with the analytic gradient to better than 1e-5; a batch gradient-descent loop that records the loss; and the closed-form solution for comparison. Everything is seeded and shape-asserted. Run it:

linreg_from_scratch.py

Three things are worth pausing on. First, gradients is a direct transcription of the boxed formulas — X.T @ resid is X(y^y)X^\top(\hat{\mathbf{y}} - \mathbf{y}), and the 2.0 / m is the factor the derivation forced. Second, the gradient check is how you gain confidence in any hand-derived gradient: perturb each parameter, measure the loss change numerically, and demand it match the analytic value. Third, GD and the closed form land on the same answer because MSE is convex — it has a single global minimum, and both roads lead there.

When to use which: gradient descent vs. the normal equations

The normal equations w=(XX)1Xy\mathbf{w} = (X^\top X)^{-1} X^\top \mathbf{y} give the exact optimum in one shot with no learning rate to tune — provided XXX^\top X is invertible and small enough to factor. That inversion costs roughly O(d3)O(d^3) and needs the whole dataset in memory, so it is ideal when dd is modest (say, up to a few thousand features) and the data fits comfortably.

Gradient descent is the workhorse everywhere else: it scales to enormous nn and dd, works in mini-batches when data will not fit in memory, tolerates near-singular XXX^\top X, and — crucially — is the only option once the model is nonlinear and no closed form exists. That is why every neural network is trained by descent. Linear regression is the rare case where you can hold the exact answer in one hand and the iterative approximation in the other and watch them agree.

Summary

  • The linear model is y^=Xw+b\hat{\mathbf{y}} = X\mathbf{w} + b, with XX of shape n×dn \times d, w\mathbf{w} of length dd, bb a scalar broadcast over the length-nn prediction — every prediction is a neuron's dot product plus a bias.
  • The MSE loss L=1ni(y^iyi)2L = \frac{1}{n}\sum_i(\hat{y}_i - y_i)^2 is a single scalar that is smooth, convex, and zero only on a perfect fit.
  • The chain rule gives the gradients wL=2nX(y^y)\nabla_{\mathbf{w}} L = \frac{2}{n}X^\top(\hat{\mathbf{y}} - \mathbf{y}) (length dd) and Lb=2ni(y^iyi)\frac{\partial L}{\partial b} = \frac{2}{n}\sum_i(\hat{y}_i - y_i) (scalar). Verify any gradient with a finite-difference check to < 1e-5.
  • Batch gradient descent repeats θθηθL\theta \leftarrow \theta - \eta\nabla_\theta L; this model-loss-gradient-loop template is how all supervised models train.
  • The normal equations w=(XX)1Xy\mathbf{w} = (X^\top X)^{-1}X^\top\mathbf{y} give the exact optimum when dd is small and XXX^\top X is invertible; gradient descent wins at scale and is mandatory for nonlinear models.

Active recall

Answer from memory before checking the lesson:

  1. Write y^\hat{\mathbf{y}} in terms of XX, w\mathbf{w}, bb, and give the shape of each object and of y^\hat{\mathbf{y}}.
  2. State the MSE loss as a sum, and say what shape it has.
  3. Derive L/wj\partial L / \partial w_j from L=1niri2L = \frac{1}{n}\sum_i r_i^2 using the chain rule. Why does the answer contain XX^\top rather than XX?
  4. What shape is wL\nabla_{\mathbf{w}} L, and how do you confirm your code computes it correctly without pen and paper?
  5. Give one situation where you would prefer the normal equations and one where you must use gradient descent.

You now have every ingredient of the Final Project: a model, a loss, hand-derived gradients, a training loop, and a gradient check. The project asks you to assemble exactly these pieces on real data — this chapter is your reference implementation.

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge