Introductory Partial Derivatives
Slopes in many directions and the gradient vector
Prerequisites
Learning objectives
- Compute partial derivatives by holding variables fixed
- Assemble partials into a gradient vector ∇f
- Read a contour plot and the gradient's geometric meaning
- Reason about the gradient of a loss with respect to parameters
Why a single derivative is not enough
Back in the one-variable world, the derivative answered one clean question: if I nudge the input, how does the output respond? That number is the engine of optimization — it tells you which way is downhill. But every model you will ever train has more than one knob. A tiny logistic-regression model has a dozen parameters; a transformer has billions. The loss is a function not of one number but of a whole vector of parameters .
So we need a derivative that works when there are many inputs at once. The idea is almost embarrassingly simple: vary one input at a time, hold the rest still. That single move — differentiate with respect to one variable while freezing the others — is the partial derivative, and collecting all of them into one vector gives the gradient, the object that training actually computes and follows. This chapter is the bridge from "the slope of a curve" to "the direction a billion-parameter model should step."
Intuition: slope along one axis, and a compass pointing uphill
Picture a surface — a landscape of hills and valleys sitting above the -plane. You are standing at one point on it. There are now two independent ways to walk:
- Walk east (increase , keep fixed). The steepness you feel is the partial derivative — the ordinary one-variable slope of the curve you get by slicing the surface along the -direction.
- Walk north (increase , keep fixed). That steepness is .
Each partial is just a familiar single-variable derivative in disguise: pretend every other variable is a constant and differentiate as usual. Holding fixed turns into a function of alone, and you already know how to differentiate that.
Now the payoff. You have two slopes, east and north. Bundle them into a vector . Remarkably, this one vector points in the direction of steepest ascent — the single compass bearing that climbs the hill fastest from where you stand. Its length says how steep that fastest climb is. Turn it around, , and you have the direction of steepest descent: the way training walks to reduce the loss.
Drag the starting point and press step. Each move goes in the direction of — straight downhill according to the local gradient. Watch how the path always cuts across the contour lines at right angles, never along them. We will see below why that perpendicularity is not a coincidence.
Formal definitions
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| A scalar-valued function of several variables | function | ℝⁿ→ℝ | fixed | |
| Partial derivative: slope along axis i, others held fixed | function | ℝⁿ→ℝ | derived | |
| Gradient: vector of all partials; points uphill | vector | n | derived | |
| Gradient evaluated at a point a (a concrete vector) | vector | n | variable | |
| The parameter vector of a model | vector | p | variable | |
| A scalar loss to be minimized | function | ℝᵖ→ℝ | fixed |
The notation ("nabla" or "del") is standard. Read as "grad ." Because the gradient has one component per input, for a model with parameters is a -dimensional vector — one number per parameter, telling you how the loss responds to nudging that parameter alone.
Worked example: both partials and the gradient
Why the gradient points the steepest way uphill
Why should the humble list of axis-slopes happen to encode the single best direction to climb? Here is the intuition, no heavy machinery required.
Reading a contour plot is now a superpower. Where contour lines bunch tightly, the function is changing fast, so is large. Where they spread out, the gradient is small — near a flat minimum the contours are far apart and . The gradient arrow at any point stabs straight across the contours, toward higher ground.
ML use case: the gradient of a loss over parameters
This is the reason the whole chapter exists. Training a model means choosing parameters to make a scalar loss small. The loss is a function of possibly billions of parameters, and we want the direction that reduces it fastest. That direction is exactly the negative gradient.
Each component answers a local question: "if I nudge this one parameter and leave the other alone, does the loss go up or down, and how fast?" Stack all answers and you have , a vector in the same space as the parameters. Gradient descent then updates every parameter by stepping a small amount opposite the gradient:
where is the learning rate (the step size). The minus sign is the entire point: the gradient points uphill toward larger loss, so we subtract it to go downhill. Repeat this step millions of times and the parameters slide into a valley of low loss. That is training, stripped to its skeleton.
NumPy: numerical gradient vs. analytic
You can always check a hand-derived gradient by approximating each partial numerically. Nudge one coordinate by a tiny , see how changes, divide — that is the definition, done arithmetically. Using a central difference, , is far more accurate than a one-sided nudge. We do it partial by partial, one coordinate at a time — exactly holding the others fixed — and compare to the formula . Run it:
The loop is the definition of the gradient made literal: it walks the axes, perturbing one coordinate while the rest stay put, and records each slope. The central difference agrees with our algebra to six decimals. In real training you would never compute gradients this way — it costs two function evaluations per parameter, hopeless for billions — but it is the gold-standard gradient check for verifying that a hand-written analytic gradient (or a backprop implementation) is correct.
Common mistakes
Summary
- A partial derivative is the ordinary single-variable derivative taken while holding every other variable fixed (treating them as constants). The symbol flags the frozen companions.
- The gradient collects all partials into one vector, the same dimension as the input.
- The gradient points in the direction of steepest ascent; its length is the rate of climb; and it is perpendicular to the contours of . Negate it for steepest descent.
- For , ; at this is .
- Training computes — the loss's partial with respect to every parameter (this is what backprop produces) — and gradient descent steps opposite it: .
- A numerical gradient (central differences, one coordinate at a time) checks an analytic gradient but is far too slow to train with.
Active recall
Answer from memory before checking the lesson:
- In words, how do you compute for a function of and ? What do you do with ?
- Compute both partials of and give at the point .
- What shape is for a model with parameters, and what does its -th component mean?
- Why does the gradient-descent update in eq. 17.3 have a minus sign? What goes wrong if you use a plus?
- Geometrically, what is the angle between the gradient and the contour line passing through the same point?
Exercises
Level ARecall & basic calculation
Gradient of the worked example
For we found . Evaluate the gradient at the point . Enter as x, y.
A single partial derivative
Let . Compute and evaluate it at (its value does not depend on ).
Partial of a product with powers
Let . Compute and evaluate it at the point .
Gradient of a linear function
Let . Give the gradient . Enter as x, y, z.
The held variable is a constant, not zero
Let . Compute and evaluate it at the point .
What does the gradient point toward?
At a given point, the gradient points in the direction of what?
Level BConceptual understanding
Gradient and contour lines
On a contour plot of , what is the angle between the gradient at a point and the contour line passing through that same point?
The sign in the descent step
Gradient descent updates parameters as . Why is there a minus sign?
Shape of the loss gradient
A model has parameters collected in , and is a scalar loss. What is the shape of ?
Reading a single loss partial
During training you compute for one parameter. In one or two sentences, say what this tells you about , and which way gradient descent will move it.
Spot the differentiation error
A student differentiates and writes , reasoning that the and terms 'have in them, so they're zero.' What did they get wrong, and what is the correct ?
Level CDerivation & implementation
Implement a numerical gradient
Write numerical_gradient(f, v, h=1e-5) that approximates at a point v using central differences, one coordinate at a time. Test it on at , confirm it matches the analytic answer with np.allclose, and print ok.
Why steepest ascent
Assume the directional derivative in a unit direction is . Using the geometric form of the dot product, show that increases fastest when points along .
One gradient-descent step
Minimize by gradient descent, starting at with learning rate . Recall . Compute . Enter as x, y.
Level DResearch-thinking challenge
Numerical gradients vs. backprop
The central-difference numerical gradient is simple and always available, yet no one trains large networks with it. Explain the cost of the numerical approach for a model with parameters, why backpropagation is dramatically cheaper, and one situation where the numerical gradient is still the right tool.
How big should the step be?
Gradient descent moves . The gradient gives the best direction, but not how far to go. Describe qualitatively what goes wrong when the learning rate is far too large, and what goes wrong when it is far too small. Why can't we just read the ideal step size off the gradient itself?