Part 4 · CalculusChapter 1490 min

Derivatives

Sensitivity, slope, and the rules that compute it

Learning objectives

  • Define the derivative from first principles as a limit
  • Apply power, sum, product, and quotient rules
  • Differentiate exponentials and logarithms
  • Approximate derivatives numerically and bound the error

Why a single number tells you where to move

Training a model is a search. You have a loss — one number that says how wrong the model currently is — and millions of parameters you are allowed to nudge. The whole enterprise rests on one question, asked over and over: if I change this one parameter a little, does the loss go up or down, and by how much? The answer to that question, for one parameter at a time, is a derivative.

The derivative is the most important object in this course. Gradient descent is just "compute derivatives, step downhill." Backpropagation is just "compute a great many derivatives efficiently." Before any of that, we need the derivative itself, seen — like the vector before it — several ways at once:

  • Geometrically, as the slope of the tangent line to a curve at a point.
  • Physically, as an instantaneous rate of change.
  • For ML, as sensitivity: how much the output moves per unit of input.

Those are three descriptions of the same limit. This chapter builds it from first principles, states the handful of rules you will reuse forever, and then computes derivatives numerically — the trick behind gradient checking.

Intuition: slope, and then sensitivity

Take a smooth curve y=f(x)y = f(x) and a point on it. Zoom in. Keep zooming. A smooth curve, magnified enough, becomes indistinguishable from a straight line — its tangent. The slope of that line is the derivative f(x)f'(x) at the point. A steep tangent means the output is changing quickly; a flat tangent (f(x)=0f'(x) = 0) means the output is, for an instant, not changing at all — the signature of a peak, a valley, or a plateau.

Slope is the geometric picture. The one that will matter most to you is sensitivity. Read f(x)f'(x) as an exchange rate: a one-unit change in the input buys approximately f(x)f'(x) units of change in the output. If f(3)=5f'(3) = 5, then near x=3x = 3 the function amplifies wiggles by a factor of five — push xx up by a tiny δ\delta and ff rises by about 5δ5\delta. If f(3)=0.2f'(3) = -0.2, the output barely moves, and moves the opposite way.

Interactive LabDerivative Visualizer
Loading interactive lab…

Drag the point along the curve above. Watch the tangent line pivot: where the curve climbs steeply the slope readout is large, where it flattens the slope passes through zero, and on downhill stretches it goes negative. The secant line through two nearby points is drawn too — notice it snaps onto the tangent as you shrink the gap. That shrinking gap is the entire definition, which we now write down.

The derivative from first principles

The tangent is the limit of secant lines. Pick a point xx and a nearby point x+hx + h. The straight line through (x,f(x))(x, f(x)) and (x+h,f(x+h))(x+h, f(x+h)) has slope

This is the difference quotient — the average rate of change over a step of size hh. Now let the step shrink. If the slope settles on a single finite value as h0h \to 0, that value is the derivative.

The limit is doing real work: it is the difference between an average rate over a finite step and an instantaneous rate at a single point. Not every function has one everywhere — x\lvert x\rvert has no single tangent slope at 00 (the kink), and ReLU inherits exactly that non-differentiable corner. Where the limit exists, though, it is unique, and it is what we compute.

Doing it once by hand: the slope of x2x^2

Nothing beats grinding the limit out once. Let f(x)=x2f(x) = x^2. Form the difference quotient and simplify before taking the limit — the whole game is cancelling the hh in the denominator so that setting h=0h = 0 is safe.

The pattern in that cancellation — expand, subtract, factor out hh, then let hh vanish — is how every rule below is proved. You will almost never do it by hand again, because the rules cache the results. But the machinery underneath is always this limit.

The power rule, and the rule table

Run the same argument on f(x)=xnf(x) = x^n. The binomial expansion gives (x+h)n=xn+nxn1h+(terms with h2 and higher)(x+h)^n = x^n + n x^{n-1} h + (\text{terms with } h^2 \text{ and higher}). Subtract xnx^n, divide by hh:

Every surviving term except the first still has an hh in it, so it dies in the limit, leaving the power rule:

It holds for every real exponent nn, not just positive integers — x=x1/2\sqrt{x} = x^{1/2} differentiates to 12x1/2\tfrac12 x^{-1/2}, and 1/x=x11/x = x^{-1} to x2-x^{-2}. Two structural rules let you take derivatives term by term:

  • Constant multiple: ddx[cf(x)]=cf(x)\dfrac{d}{dx}\big[c\,f(x)\big] = c\,f'(x) — a constant scale on the input passes straight through to the derivative.
  • Sum: ddx[f(x)+g(x)]=f(x)+g(x)\dfrac{d}{dx}\big[f(x) + g(x)\big] = f'(x) + g'(x) — differentiate each piece and add.

Together they mean a polynomial differentiates one term at a time. For example ddx(3x45x2+7)=12x310x\dfrac{d}{dx}\big(3x^4 - 5x^2 + 7\big) = 12x^3 - 10x: the power rule on each term, the constant 77 contributing 00 because a flat line has zero slope.

The rules you will actually reuse

Four more rules cover almost everything you will differentiate this course. The two that trip people up are the product and quotient rules — they are not what naive guessing suggests (more on that in the warning below).

The product rule deserves its shape. Over a small step both factors change, and the area fgf\cdot g grows on two sides: gg's growth scaled by the current ff, plus ff's growth scaled by the current gg. That is exactly fg+fgf'g + f g'. Example: for h(x)=x2exh(x) = x^2 e^x, take f=x2f = x^2 (f=2xf' = 2x) and g=exg = e^x (g=exg' = e^x), giving h(x)=2xex+x2ex=(2x+x2)exh'(x) = 2x\,e^x + x^2 e^x = (2x + x^2)e^x.

The quotient rule follows from the product rule applied to f(1/g)f\cdot(1/g). Example: ddxxx+1=(1)(x+1)(x)(1)(x+1)2=1(x+1)2\dfrac{d}{dx}\dfrac{x}{x+1} = \dfrac{(1)(x+1) - (x)(1)}{(x+1)^2} = \dfrac{1}{(x+1)^2}.

Two special functions round out the toolkit. The exponential exe^x is the unique function that equals its own derivative — its rate of growth is its current value, which is why it models everything that compounds. And the natural log, its inverse, has derivative ddxlnx=1x\dfrac{d}{dx}\ln x = \dfrac1x: steep near zero, flattening as xx grows, defined only for x>0x > 0. These two show up constantly in ML because the softmax, the sigmoid, and the cross-entropy loss are all built from exe^x and ln\ln.

Higher derivatives

The derivative ff' is itself a function, so you can differentiate it again. The second derivative f(x)=d2fdx2f''(x) = \dfrac{d^2 f}{dx^2} is the rate of change of the slope — the curvature. For f(x)=x3f(x) = x^3: f(x)=3x2f'(x) = 3x^2, f(x)=6xf''(x) = 6x, f(x)=6f'''(x) = 6. In optimization the sign of ff'' distinguishes a valley (f>0f'' > 0, convex, cupping upward) from a peak (f<0f'' < 0), and the whole matrix of second derivatives — the Hessian — governs how fast second-order methods converge.

ML use case: sensitivity, and gradient checking

Here is the payoff. A loss L(w)L(w) is a function of a parameter ww. Its derivative dLdw\dfrac{dL}{dw} is the sensitivity of the loss to that parameter: how much LL moves per unit change in ww. If dLdw>0\dfrac{dL}{dw} > 0, nudging ww up increases the loss, so you want to move ww down; if it is negative, move up. The update

is one step of gradient descent: step against the derivative, scaled by a learning rate η\eta. Every parameter gets nudged in the direction that its own derivative says lowers the loss. In many dimensions the collection of these per-parameter derivatives is the gradient — sensitivity generalized to a vector — but the atom is exactly the single-variable derivative of this chapter.

The second payoff is practical. When you hand-code a derivative (a backprop implementation, a custom layer), how do you know it is correct? You compare it against a numerical derivative computed straight from the definition — a technique called gradient checking. If the analytic value and the numerical value agree to several digits, your math is almost certainly right; if they diverge, you have a bug. For that we need to compute derivatives numerically, and to do it well.

Computing derivatives numerically

The naive approach evaluates the difference quotient at a small but nonzero hh. Two choices:

  • Forward difference: f(x+h)f(x)h\dfrac{f(x+h) - f(x)}{h}. Accuracy O(h)O(h) — halving hh roughly halves the error.
  • Central difference: f(x+h)f(xh)2h\dfrac{f(x+h) - f(x-h)}{2h}. Accuracy O(h2)O(h^2) — halving hh cuts the error by four.

You might think shrinking hh toward zero always improves the estimate. It does not, because of a second, competing error. There are two sources:

  • Truncation error from stopping the limit at a finite hh. It shrinks as h0h \to 0 (like h2h^2 for the central difference).
  • Round-off error from floating point. Computing f(x+h)f(xh)f(x+h) - f(x-h) subtracts two nearly equal numbers; the leading digits cancel and you keep only the noisy trailing bits. Dividing that noise by the tiny 2h2h amplifies it. This error grows as h0h \to 0.

The total error is a U-shaped curve: too large an hh and truncation dominates, too small and round-off explodes. The sweet spot for the central difference sits near hε3105h \approx \sqrt[3]{\varepsilon} \approx 10^{-5} in double precision (where ε2.2×1016\varepsilon \approx 2.2\times10^{-16} is machine epsilon). The code below computes the analytic derivative, compares it to the central difference, and sweeps hh to reveal the U.

central_difference.py

The sweep prints an error that drops, bottoms out around h105h \approx 10^{-5} to 10610^{-6}, and then climbs again as round-off takes over — the U-shaped tradeoff made concrete. This is exactly the numerical derivative you would use to gradient-check a hand-written backprop pass.

Summary

  • The derivative is f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h\to0}\dfrac{f(x+h)-f(x)}{h}: the limit of the difference quotient, equal to the tangent slope, the instantaneous rate of change, and — the ML reading — the sensitivity of the output to the input.
  • From first principles, ddxx2=2x\dfrac{d}{dx}x^2 = 2x; the general power rule is ddxxn=nxn1\dfrac{d}{dx}x^n = n x^{n-1}, and with the sum and constant-multiple rules a polynomial differentiates term by term.
  • Product: (fg)=fg+fg(fg)' = f'g + fg'. Quotient: (fg)=fgfgg2\left(\tfrac{f}{g}\right)' = \dfrac{f'g - fg'}{g^2}. Exponential: ddxex=ex\dfrac{d}{dx}e^x = e^x. Log: ddxlnx=1x\dfrac{d}{dx}\ln x = \tfrac1x.
  • Higher derivatives iterate the operation; ff'' is curvature and its sign marks valleys (>0>0) versus peaks (<0<0).
  • In ML the derivative of a loss dLdw\dfrac{dL}{dw} says which way to nudge a parameter: wwηdLdww \leftarrow w - \eta\,\dfrac{dL}{dw}. Sensitivity generalizes to the gradient.
  • The central difference f(x+h)f(xh)2h\dfrac{f(x+h)-f(x-h)}{2h} is O(h2)O(h^2) accurate and beats the forward difference. Choosing hh trades truncation error (falls with hh) against round-off (grows as h0h\to0); the error is U-shaped with a sweet spot near h105h\approx10^{-5}. Central differences are the tool for gradient checking hand-written derivatives.

Active recall

Answer from memory before checking the lesson:

  1. Write the definition of f(x)f'(x) as a limit, and give the three readings of the number it produces (slope, rate, sensitivity).
  2. Differentiate x2x^2 from first principles. Where in the algebra does the limit become safe to evaluate?
  3. State the product and quotient rules. Why is (fg)(fg)' not equal to fgf'g'?
  4. What are ddxex\dfrac{d}{dx}e^x and ddxlnx\dfrac{d}{dx}\ln x?
  5. Why does the central difference beat the forward difference for the same hh, and why does making hh extremely small hurt accuracy?

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge