Part 4 · CalculusChapter 1775 min

Introductory Partial Derivatives

Slopes in many directions and the gradient vector

Prerequisites

Learning objectives

  • Compute partial derivatives by holding variables fixed
  • Assemble partials into a gradient vector ∇f
  • Read a contour plot and the gradient's geometric meaning
  • Reason about the gradient of a loss with respect to parameters

Why a single derivative is not enough

Back in the one-variable world, the derivative f(x)f'(x) answered one clean question: if I nudge the input, how does the output respond? That number is the engine of optimization — it tells you which way is downhill. But every model you will ever train has more than one knob. A tiny logistic-regression model has a dozen parameters; a transformer has billions. The loss LL is a function not of one number but of a whole vector of parameters θ\boldsymbol\theta.

So we need a derivative that works when there are many inputs at once. The idea is almost embarrassingly simple: vary one input at a time, hold the rest still. That single move — differentiate with respect to one variable while freezing the others — is the partial derivative, and collecting all of them into one vector gives the gradient, the object that training actually computes and follows. This chapter is the bridge from "the slope of a curve" to "the direction a billion-parameter model should step."

Intuition: slope along one axis, and a compass pointing uphill

Picture a surface z=f(x,y)z = f(x, y) — a landscape of hills and valleys sitting above the xyxy-plane. You are standing at one point on it. There are now two independent ways to walk:

  • Walk east (increase xx, keep yy fixed). The steepness you feel is the partial derivative f/x\partial f / \partial x — the ordinary one-variable slope of the curve you get by slicing the surface along the xx-direction.
  • Walk north (increase yy, keep xx fixed). That steepness is f/y\partial f / \partial y.

Each partial is just a familiar single-variable derivative in disguise: pretend every other variable is a constant and differentiate as usual. Holding yy fixed turns f(x,y)f(x, y) into a function of xx alone, and you already know how to differentiate that.

Now the payoff. You have two slopes, east and north. Bundle them into a vector f=(f/x, f/y)\nabla f = (\partial f/\partial x,\ \partial f/\partial y). Remarkably, this one vector points in the direction of steepest ascent — the single compass bearing that climbs the hill fastest from where you stand. Its length says how steep that fastest climb is. Turn it around, f-\nabla f, and you have the direction of steepest descent: the way training walks to reduce the loss.

Interactive LabGradient-Descent Visualizer
Loading interactive lab…

Drag the starting point and press step. Each move goes in the direction of L-\nabla L — straight downhill according to the local gradient. Watch how the path always cuts across the contour lines at right angles, never along them. We will see below why that perpendicularity is not a coincidence.

Formal definitions

The notation \nabla ("nabla" or "del") is standard. Read f\nabla f as "grad ff." Because the gradient has one component per input, L\nabla L for a model with pp parameters is a pp-dimensional vector — one number per parameter, telling you how the loss responds to nudging that parameter alone.

Worked example: both partials and the gradient

Why the gradient points the steepest way uphill

Why should the humble list of axis-slopes happen to encode the single best direction to climb? Here is the intuition, no heavy machinery required.

Reading a contour plot is now a superpower. Where contour lines bunch tightly, the function is changing fast, so f\lVert \nabla f \rVert is large. Where they spread out, the gradient is small — near a flat minimum the contours are far apart and f0\nabla f \to \mathbf{0}. The gradient arrow at any point stabs straight across the contours, toward higher ground.

ML use case: the gradient of a loss over parameters

This is the reason the whole chapter exists. Training a model means choosing parameters θ\boldsymbol\theta to make a scalar loss L(θ)L(\boldsymbol\theta) small. The loss is a function of possibly billions of parameters, and we want the direction that reduces it fastest. That direction is exactly the negative gradient.

Each component L/θj\partial L / \partial \theta_j answers a local question: "if I nudge this one parameter and leave the other p1p-1 alone, does the loss go up or down, and how fast?" Stack all pp answers and you have L\nabla L, a vector in the same space as the parameters. Gradient descent then updates every parameter by stepping a small amount opposite the gradient:

where η>0\eta > 0 is the learning rate (the step size). The minus sign is the entire point: the gradient points uphill toward larger loss, so we subtract it to go downhill. Repeat this step millions of times and the parameters slide into a valley of low loss. That is training, stripped to its skeleton.

NumPy: numerical gradient vs. analytic

You can always check a hand-derived gradient by approximating each partial numerically. Nudge one coordinate by a tiny hh, see how ff changes, divide — that is the definition, done arithmetically. Using a central difference, fxif(x+hei)f(xhei)2h\frac{\partial f}{\partial x_i} \approx \frac{f(\mathbf{x} + h\,\mathbf{e}_i) - f(\mathbf{x} - h\,\mathbf{e}_i)}{2h}, is far more accurate than a one-sided nudge. We do it partial by partial, one coordinate at a time — exactly holding the others fixed — and compare to the formula f=(2x+y, x+2y)\nabla f = (2x + y,\ x + 2y). Run it:

numerical_gradient.py

The loop is the definition of the gradient made literal: it walks the axes, perturbing one coordinate while the rest stay put, and records each slope. The central difference agrees with our algebra to six decimals. In real training you would never compute gradients this way — it costs two function evaluations per parameter, hopeless for billions — but it is the gold-standard gradient check for verifying that a hand-written analytic gradient (or a backprop implementation) is correct.

Common mistakes

Summary

  • A partial derivative f/xi\partial f / \partial x_i is the ordinary single-variable derivative taken while holding every other variable fixed (treating them as constants). The \partial symbol flags the frozen companions.
  • The gradient f=(f/x1,,f/xn)\nabla f = (\partial f/\partial x_1, \ldots, \partial f/\partial x_n) collects all partials into one vector, the same dimension as the input.
  • The gradient points in the direction of steepest ascent; its length is the rate of climb; and it is perpendicular to the contours of ff. Negate it for steepest descent.
  • For f(x,y)=x2+xy+y2f(x,y) = x^2 + xy + y^2, f=(2x+y, x+2y)\nabla f = (2x + y,\ x + 2y); at (1,2)(1,2) this is (4,5)(4, 5).
  • Training computes L(θ)\nabla L(\boldsymbol\theta) — the loss's partial with respect to every parameter (this is what backprop produces) — and gradient descent steps opposite it: θθηL\boldsymbol\theta \leftarrow \boldsymbol\theta - \eta\,\nabla L.
  • A numerical gradient (central differences, one coordinate at a time) checks an analytic gradient but is far too slow to train with.

Active recall

Answer from memory before checking the lesson:

  1. In words, how do you compute f/x\partial f / \partial x for a function of xx and yy? What do you do with yy?
  2. Compute both partials of f(x,y)=x2+xy+y2f(x, y) = x^2 + xy + y^2 and give f\nabla f at the point (0,3)(0, 3).
  3. What shape is L\nabla L for a model with pp parameters, and what does its jj-th component mean?
  4. Why does the gradient-descent update in eq. 17.3 have a minus sign? What goes wrong if you use a plus?
  5. Geometrically, what is the angle between the gradient and the contour line passing through the same point?

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge