Derivatives
Sensitivity, slope, and the rules that compute it
Prerequisites
Learning objectives
- Define the derivative from first principles as a limit
- Apply power, sum, product, and quotient rules
- Differentiate exponentials and logarithms
- Approximate derivatives numerically and bound the error
Why a single number tells you where to move
Training a model is a search. You have a loss — one number that says how wrong the model currently is — and millions of parameters you are allowed to nudge. The whole enterprise rests on one question, asked over and over: if I change this one parameter a little, does the loss go up or down, and by how much? The answer to that question, for one parameter at a time, is a derivative.
The derivative is the most important object in this course. Gradient descent is just "compute derivatives, step downhill." Backpropagation is just "compute a great many derivatives efficiently." Before any of that, we need the derivative itself, seen — like the vector before it — several ways at once:
- Geometrically, as the slope of the tangent line to a curve at a point.
- Physically, as an instantaneous rate of change.
- For ML, as sensitivity: how much the output moves per unit of input.
Those are three descriptions of the same limit. This chapter builds it from first principles, states the handful of rules you will reuse forever, and then computes derivatives numerically — the trick behind gradient checking.
Intuition: slope, and then sensitivity
Take a smooth curve and a point on it. Zoom in. Keep zooming. A smooth curve, magnified enough, becomes indistinguishable from a straight line — its tangent. The slope of that line is the derivative at the point. A steep tangent means the output is changing quickly; a flat tangent () means the output is, for an instant, not changing at all — the signature of a peak, a valley, or a plateau.
Slope is the geometric picture. The one that will matter most to you is sensitivity. Read as an exchange rate: a one-unit change in the input buys approximately units of change in the output. If , then near the function amplifies wiggles by a factor of five — push up by a tiny and rises by about . If , the output barely moves, and moves the opposite way.
Drag the point along the curve above. Watch the tangent line pivot: where the curve climbs steeply the slope readout is large, where it flattens the slope passes through zero, and on downhill stretches it goes negative. The secant line through two nearby points is drawn too — notice it snaps onto the tangent as you shrink the gap. That shrinking gap is the entire definition, which we now write down.
The derivative from first principles
The tangent is the limit of secant lines. Pick a point and a nearby point . The straight line through and has slope
This is the difference quotient — the average rate of change over a step of size . Now let the step shrink. If the slope settles on a single finite value as , that value is the derivative.
The limit is doing real work: it is the difference between an average rate over a finite step and an instantaneous rate at a single point. Not every function has one everywhere — has no single tangent slope at (the kink), and ReLU inherits exactly that non-differentiable corner. Where the limit exists, though, it is unique, and it is what we compute.
Doing it once by hand: the slope of
Nothing beats grinding the limit out once. Let . Form the difference quotient and simplify before taking the limit — the whole game is cancelling the in the denominator so that setting is safe.
The pattern in that cancellation — expand, subtract, factor out , then let vanish — is how every rule below is proved. You will almost never do it by hand again, because the rules cache the results. But the machinery underneath is always this limit.
The power rule, and the rule table
Run the same argument on . The binomial expansion gives . Subtract , divide by :
Every surviving term except the first still has an in it, so it dies in the limit, leaving the power rule:
It holds for every real exponent , not just positive integers — differentiates to , and to . Two structural rules let you take derivatives term by term:
- Constant multiple: — a constant scale on the input passes straight through to the derivative.
- Sum: — differentiate each piece and add.
Together they mean a polynomial differentiates one term at a time. For example : the power rule on each term, the constant contributing because a flat line has zero slope.
The rules you will actually reuse
Four more rules cover almost everything you will differentiate this course. The two that trip people up are the product and quotient rules — they are not what naive guessing suggests (more on that in the warning below).
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| Power rule (any real n) | rule | — | core | |
| Sum rule — differentiate termwise | rule | — | core | |
| Constant multiple passes through | rule | — | core | |
| Product rule | rule | — | core | |
| Quotient rule (g ≠ 0) | rule | — | core | |
| The exponential is its own derivative | rule | — | core | |
| Natural log, for x > 0 | rule | — | core |
The product rule deserves its shape. Over a small step both factors change, and the area grows on two sides: 's growth scaled by the current , plus 's growth scaled by the current . That is exactly . Example: for , take () and (), giving .
The quotient rule follows from the product rule applied to . Example: .
Two special functions round out the toolkit. The exponential is the unique function that equals its own derivative — its rate of growth is its current value, which is why it models everything that compounds. And the natural log, its inverse, has derivative : steep near zero, flattening as grows, defined only for . These two show up constantly in ML because the softmax, the sigmoid, and the cross-entropy loss are all built from and .
Higher derivatives
The derivative is itself a function, so you can differentiate it again. The second derivative is the rate of change of the slope — the curvature. For : , , . In optimization the sign of distinguishes a valley (, convex, cupping upward) from a peak (), and the whole matrix of second derivatives — the Hessian — governs how fast second-order methods converge.
ML use case: sensitivity, and gradient checking
Here is the payoff. A loss is a function of a parameter . Its derivative is the sensitivity of the loss to that parameter: how much moves per unit change in . If , nudging up increases the loss, so you want to move down; if it is negative, move up. The update
is one step of gradient descent: step against the derivative, scaled by a learning rate . Every parameter gets nudged in the direction that its own derivative says lowers the loss. In many dimensions the collection of these per-parameter derivatives is the gradient — sensitivity generalized to a vector — but the atom is exactly the single-variable derivative of this chapter.
The second payoff is practical. When you hand-code a derivative (a backprop implementation, a custom layer), how do you know it is correct? You compare it against a numerical derivative computed straight from the definition — a technique called gradient checking. If the analytic value and the numerical value agree to several digits, your math is almost certainly right; if they diverge, you have a bug. For that we need to compute derivatives numerically, and to do it well.
Computing derivatives numerically
The naive approach evaluates the difference quotient at a small but nonzero . Two choices:
- Forward difference: . Accuracy — halving roughly halves the error.
- Central difference: . Accuracy — halving cuts the error by four.
You might think shrinking toward zero always improves the estimate. It does not, because of a second, competing error. There are two sources:
- Truncation error from stopping the limit at a finite . It shrinks as (like for the central difference).
- Round-off error from floating point. Computing subtracts two nearly equal numbers; the leading digits cancel and you keep only the noisy trailing bits. Dividing that noise by the tiny amplifies it. This error grows as .
The total error is a U-shaped curve: too large an and truncation dominates, too small and round-off explodes. The sweet spot for the central difference sits near in double precision (where is machine epsilon). The code below computes the analytic derivative, compares it to the central difference, and sweeps to reveal the U.
The sweep prints an error that drops, bottoms out around to , and then climbs again as round-off takes over — the U-shaped tradeoff made concrete. This is exactly the numerical derivative you would use to gradient-check a hand-written backprop pass.
Summary
- The derivative is : the limit of the difference quotient, equal to the tangent slope, the instantaneous rate of change, and — the ML reading — the sensitivity of the output to the input.
- From first principles, ; the general power rule is , and with the sum and constant-multiple rules a polynomial differentiates term by term.
- Product: . Quotient: . Exponential: . Log: .
- Higher derivatives iterate the operation; is curvature and its sign marks valleys () versus peaks ().
- In ML the derivative of a loss says which way to nudge a parameter: . Sensitivity generalizes to the gradient.
- The central difference is accurate and beats the forward difference. Choosing trades truncation error (falls with ) against round-off (grows as ); the error is U-shaped with a sweet spot near . Central differences are the tool for gradient checking hand-written derivatives.
Active recall
Answer from memory before checking the lesson:
- Write the definition of as a limit, and give the three readings of the number it produces (slope, rate, sensitivity).
- Differentiate from first principles. Where in the algebra does the limit become safe to evaluate?
- State the product and quotient rules. Why is not equal to ?
- What are and ?
- Why does the central difference beat the forward difference for the same , and why does making extremely small hurt accuracy?
Exercises
Level ARecall & basic calculation
Differentiate a polynomial, evaluate at a point
Let . Using the power, sum, and constant-multiple rules, compute .
Power rule at a point
Let . Compute .
Derivative of the exponential
Let . Compute . (Recall .)
Derivative of the natural log
Let . Compute .
Constant multiple rule
Let . Compute .
Derivative of a constant
What is — the derivative of the constant function ?
Power rule with a fractional exponent
Let . Compute .
Level BConceptual understanding
Apply the product rule
Let . Using the product rule, compute . (Use .)
Apply the quotient rule
Let . Using the quotient rule, compute .
Sensitivity and the descent direction
A loss has derivative . To decrease the loss, should you increase or decrease , and roughly how much does change if you nudge by ?
A second derivative
Let . Compute the second derivative , and state whether the curve is locally cupping upward or downward there.
Why central beats forward
The forward difference has error ; the central difference has error . For a small , which is more accurate, and by roughly what factor does the central-difference error shrink when you halve ?
Level CDerivation & implementation
Differentiate x³ from first principles
Using only the limit definition , derive the derivative of . Show the cancellation of before taking the limit.
Implement the central-difference derivative
Implement numerical_derivative(f, x, h=1e-5) using the central difference . Verify it against the analytic derivative for (which is ) and at a couple of points using np.isclose, then print ok.
Derive the quotient rule from the product rule
Assuming the product rule and the chain-rule fact , derive the quotient rule .
Sweep h to find the error minimum
For at , compute the central-difference error across . Print each error, confirm with an assert that the minimum occurs at an interior (not the smallest one), then print ok.
Level DResearch-thinking challenge
Designing a gradient check
You want to gradient-check a hand-written backprop by comparing its analytic gradient against a numerical gradient . (a) Why do practitioners compare the relative error rather than the absolute error? (b) Why use the central difference rather than the forward difference here? (c) Name one failure mode where a correct analytic gradient still fails a naive gradient check, and how to avoid it.
Why h ≈ cube-root of machine epsilon
For the central difference, the total error is roughly , where the first term is truncation and the second is round-off ( is machine epsilon, ). Minimize over to explain why the optimal step is on the order of , and state what that makes the best achievable error.