Derivatives

Why a single number tells you where to move

Training a model is a search. You have a loss — one number that says how wrong the model currently is — and millions of parameters you are allowed to nudge. The whole enterprise rests on one question, asked over and over: if I change this one parameter a little, does the loss go up or down, and by how much? The answer to that question, for one parameter at a time, is a derivative.

The derivative is the most important object in this course. Gradient descent is just "compute derivatives, step downhill." Backpropagation is just "compute a great many derivatives efficiently." Before any of that, we need the derivative itself, seen — like the vector before it — several ways at once:

Geometrically, as the slope of the tangent line to a curve at a point.
Physically, as an instantaneous rate of change.
For ML, as sensitivity: how much the output moves per unit of input.

Those are three descriptions of the same limit. This chapter builds it from first principles, states the handful of rules you will reuse forever, and then computes derivatives numerically — the trick behind gradient checking.

Intuition: slope, and then sensitivity

Take a smooth curve $y = f(x)$ and a point on it. Zoom in. Keep zooming. A smooth curve, magnified enough, becomes indistinguishable from a straight line — its tangent. The slope of that line is the derivative $f'(x)$ at the point. A steep tangent means the output is changing quickly; a flat tangent ( $f'(x) = 0$ ) means the output is, for an instant, not changing at all — the signature of a peak, a valley, or a plateau.

Slope is the geometric picture. The one that will matter most to you is sensitivity. Read $f'(x)$ as an exchange rate: a one-unit change in the input buys approximately $f'(x)$ units of change in the output. If $f'(3) = 5$ , then near $x = 3$ the function amplifies wiggles by a factor of five — push $x$ up by a tiny $\delta$ and $f$ rises by about $5\delta$ . If $f'(3) = -0.2$ , the output barely moves, and moves the opposite way.

Interactive LabDerivative Visualizer

Loading interactive lab…

Drag the point along the curve above. Watch the tangent line pivot: where the curve climbs steeply the slope readout is large, where it flattens the slope passes through zero, and on downhill stretches it goes negative. The secant line through two nearby points is drawn too — notice it snaps onto the tangent as you shrink the gap. That shrinking gap is the entire definition, which we now write down.

The derivative from first principles

The tangent is the limit of secant lines. Pick a point $x$ and a nearby point $x + h$ . The straight line through $(x, f(x))$ and $(x+h, f(x+h))$ has slope

\frac{\Delta y}{\Delta x} = \frac{f(x+h) - f(x)}{h}

(14.1)

This is the difference quotient — the average rate of change over a step of size $h$ . Now let the step shrink. If the slope settles on a single finite value as $h \to 0$ , that value is the derivative.

The limit is doing real work: it is the difference between an average rate over a finite step and an instantaneous rate at a single point. Not every function has one everywhere — $\lvert x\rvert$ has no single tangent slope at $0$ (the kink), and ReLU inherits exactly that non-differentiable corner. Where the limit exists, though, it is unique, and it is what we compute.

Doing it once by hand: the slope of $x^2$

Nothing beats grinding the limit out once. Let $f(x) = x^2$ . Form the difference quotient and simplify before taking the limit — the whole game is cancelling the $h$ in the denominator so that setting $h = 0$ is safe.

Worked Example — differentiating x² from first principles

Start from the definition at a general point $x$ : $\frac{f(x+h) - f(x)}{h} = \frac{(x+h)^2 - x^2}{h}.$ Expand the square: $(x+h)^2 = x^2 + 2xh + h^2$ , so the numerator is $x^2 + 2xh + h^2 - x^2 = 2xh + h^2$ . Therefore $\frac{2xh + h^2}{h} = \frac{h(2x + h)}{h} = 2x + h \qquad (h \neq 0).$ Now the limit is trivial — no division by zero remains: $f'(x) = \lim_{h \to 0} (2x + h) = 2x.$ So $\dfrac{d}{dx}x^2 = 2x$ . At $x = 3$ the slope is $6$ : near there, a one-unit nudge in $x$ changes $x^2$ by about six units.

The pattern in that cancellation — expand, subtract, factor out $h$ , then let $h$ vanish — is how every rule below is proved. You will almost never do it by hand again, because the rules cache the results. But the machinery underneath is always this limit.

The power rule, and the rule table

Run the same argument on $f(x) = x^n$ . The binomial expansion gives $(x+h)^n = x^n + n x^{n-1} h + (\text{terms with } h^2 \text{ and higher})$ . Subtract $x^n$ , divide by $h$ :

\frac{(x+h)^n - x^n}{h} = n\,x^{n-1} + (\text{terms that still carry a factor of } h)

(14.2)

Every surviving term except the first still has an $h$ in it, so it dies in the limit, leaving the power rule:

\frac{d}{dx}\,x^n = n\,x^{n-1}

(14.3)

It holds for every real exponent $n$ , not just positive integers — $\sqrt{x} = x^{1/2}$ differentiates to $\tfrac12 x^{-1/2}$ , and $1/x = x^{-1}$ to $-x^{-2}$ . Two structural rules let you take derivatives term by term:

Constant multiple: $\dfrac{d}{dx}\big[c\,f(x)\big] = c\,f'(x)$ — a constant scale on the input passes straight through to the derivative.
Sum: $\dfrac{d}{dx}\big[f(x) + g(x)\big] = f'(x) + g'(x)$ — differentiate each piece and add.

Together they mean a polynomial differentiates one term at a time. For example $\dfrac{d}{dx}\big(3x^4 - 5x^2 + 7\big) = 12x^3 - 10x$ : the power rule on each term, the constant $7$ contributing $0$ because a flat line has zero slope.

The rules you will actually reuse

Four more rules cover almost everything you will differentiate this course. The two that trip people up are the product and quotient rules — they are not what naive guessing suggests (more on that in the warning below).

Symbol	Meaning	Type	Shape	Role
$\frac{d}{dx}\,x^n = n\,x^{n-1}$	Power rule (any real n)	rule	—	core
$(f+g)' = f' + g'$	Sum rule — differentiate termwise	rule	—	core
$(c\,f)' = c\,f'$	Constant multiple passes through	rule	—	core
$(fg)' = f'g + f\,g'$	Product rule	rule	—	core
$\left(\frac{f}{g}\right)' = \frac{f'g - f\,g'}{g^2}$	Quotient rule (g ≠ 0)	rule	—	core
$\frac{d}{dx}\,e^x = e^x$	The exponential is its own derivative	rule	—	core
$\frac{d}{dx}\,\ln x = \frac{1}{x}$	Natural log, for x > 0	rule	—	core

The product rule deserves its shape. Over a small step both factors change, and the area $f\cdot g$ grows on two sides: $g$ 's growth scaled by the current $f$ , plus $f$ 's growth scaled by the current $g$ . That is exactly $f'g + f g'$ . Example: for $h(x) = x^2 e^x$ , take $f = x^2$ ( $f' = 2x$ ) and $g = e^x$ ( $g' = e^x$ ), giving $h'(x) = 2x\,e^x + x^2 e^x = (2x + x^2)e^x$ .

The quotient rule follows from the product rule applied to $f\cdot(1/g)$ . Example: $\dfrac{d}{dx}\dfrac{x}{x+1} = \dfrac{(1)(x+1) - (x)(1)}{(x+1)^2} = \dfrac{1}{(x+1)^2}$ .

Two special functions round out the toolkit. The exponential $e^x$ is the unique function that equals its own derivative — its rate of growth is its current value, which is why it models everything that compounds. And the natural log, its inverse, has derivative $\dfrac{d}{dx}\ln x = \dfrac1x$ : steep near zero, flattening as $x$ grows, defined only for $x > 0$ . These two show up constantly in ML because the softmax, the sigmoid, and the cross-entropy loss are all built from $e^x$ and $\ln$ .

Higher derivatives

The derivative $f'$ is itself a function, so you can differentiate it again. The second derivative $f''(x) = \dfrac{d^2 f}{dx^2}$ is the rate of change of the slope — the curvature. For $f(x) = x^3$ : $f'(x) = 3x^2$ , $f''(x) = 6x$ , $f'''(x) = 6$ . In optimization the sign of $f''$ distinguishes a valley ( $f'' > 0$ , convex, cupping upward) from a peak ( $f'' < 0$ ), and the whole matrix of second derivatives — the Hessian — governs how fast second-order methods converge.

ML use case: sensitivity, and gradient checking

Here is the payoff. A loss $L(w)$ is a function of a parameter $w$ . Its derivative $\dfrac{dL}{dw}$ is the sensitivity of the loss to that parameter: how much $L$ moves per unit change in $w$ . If $\dfrac{dL}{dw} > 0$ , nudging $w$ up increases the loss, so you want to move $w$ down; if it is negative, move up. The update

w \leftarrow w - \eta\,\frac{dL}{dw}

(14.4)

is one step of gradient descent: step against the derivative, scaled by a learning rate $\eta$ . Every parameter gets nudged in the direction that its own derivative says lowers the loss. In many dimensions the collection of these per-parameter derivatives is the gradient — sensitivity generalized to a vector — but the atom is exactly the single-variable derivative of this chapter.

The second payoff is practical. When you hand-code a derivative (a backprop implementation, a custom layer), how do you know it is correct? You compare it against a numerical derivative computed straight from the definition — a technique called gradient checking. If the analytic value and the numerical value agree to several digits, your math is almost certainly right; if they diverge, you have a bug. For that we need to compute derivatives numerically, and to do it well.

Computing derivatives numerically

The naive approach evaluates the difference quotient at a small but nonzero $h$ . Two choices:

Forward difference: $\dfrac{f(x+h) - f(x)}{h}$ . Accuracy $O(h)$ — halving $h$ roughly halves the error.
Central difference: $\dfrac{f(x+h) - f(x-h)}{2h}$ . Accuracy $O(h^2)$ — halving $h$ cuts the error by four.

You might think shrinking $h$ toward zero always improves the estimate. It does not, because of a second, competing error. There are two sources:

Truncation error from stopping the limit at a finite $h$ . It shrinks as $h \to 0$ (like $h^2$ for the central difference).
Round-off error from floating point. Computing $f(x+h) - f(x-h)$ subtracts two nearly equal numbers; the leading digits cancel and you keep only the noisy trailing bits. Dividing that noise by the tiny $2h$ amplifies it. This error grows as $h \to 0$ .

The total error is a U-shaped curve: too large an $h$ and truncation dominates, too small and round-off explodes. The sweet spot for the central difference sits near $h \approx \sqrt[3]{\varepsilon} \approx 10^{-5}$ in double precision (where $\varepsilon \approx 2.2\times10^{-16}$ is machine epsilon). The code below computes the analytic derivative, compares it to the central difference, and sweeps $h$ to reveal the U.

central_difference.py

import numpy as np

# Central difference: f'(x) ~ (f(x+h) - f(x-h)) / (2h), O(h^2) accurate.
def numerical_derivative(f, x, h=1e-5):
  return (f(x + h) - f(x - h)) / (2.0 * h)

# A few functions paired with their known analytic derivatives.
cases = [
  ("x^2",  lambda x: x**2,      lambda x: 2*x),
  ("x^3",  lambda x: x**3,      lambda x: 3*x**2),
  ("e^x",  lambda x: np.exp(x), lambda x: np.exp(x)),
  ("ln x", lambda x: np.log(x), lambda x: 1.0/x),
]

x0 = 1.5
print("central difference vs analytic at x =", x0)
for name, f, df in cases:
  approx = numerical_derivative(f, x0)
  exact = df(x0)
  print("  ", name, "numeric=", round(approx, 8), "analytic=", round(exact, 8))
  # Gradient check: the two must agree to a tight tolerance.
  assert np.isclose(approx, exact, atol=1e-6), name

# Sweep h to expose the truncation-vs-roundoff tradeoff on f(x) = e^x.
# Error falls as h shrinks (truncation ~ h^2), then rises (round-off).
f, df = np.exp, np.exp
errors = []
hs = [10.0**(-k) for k in range(1, 13)]  # 1e-1 down to 1e-12
for h in hs:
  err = abs(numerical_derivative(f, x0, h) - df(x0))
  errors.append(err)
  print("  h=", h, "error=", err)

best = int(np.argmin(errors))
print("min error near h =", hs[best])
# The minimum sits in the interior, not at the smallest h - proof of the U-shape.
assert 0 < best < len(hs) - 1, "error should bottom out at an interior h"
print("ok")

The sweep prints an error that drops, bottoms out around $h \approx 10^{-5}$ to $10^{-6}$ , and then climbs again as round-off takes over — the U-shaped tradeoff made concrete. This is exactly the numerical derivative you would use to gradient-check a hand-written backprop pass.

Two traps: the product rule, and greedily shrinking h

The derivative of a product is not the product of the derivatives: $\dfrac{d}{dx}\big[f g\big] \neq f' g'$ . For $f = g = x$ the wrong rule gives $1\cdot 1 = 1$ , but $\dfrac{d}{dx}x^2 = 2x$ — off unless $x = \tfrac12$ . Always use $f'g + f g'$ . Likewise, the derivative of a quotient is not the quotient of derivatives.

The second trap is numerical: do not make $h$ as small as you can. Past the sweet spot ( $h \approx 10^{-5}$ for the central difference in double precision), subtracting two nearly equal values $f(x+h)$ and $f(x-h)$ loses significant digits to catastrophic cancellation, and dividing that noise by a tiny $2h$ blows the error up. Smaller is not always better.

Summary

The derivative is $f'(x) = \lim_{h\to0}\dfrac{f(x+h)-f(x)}{h}$ : the limit of the difference quotient, equal to the tangent slope, the instantaneous rate of change, and — the ML reading — the sensitivity of the output to the input.
From first principles, $\dfrac{d}{dx}x^2 = 2x$ ; the general power rule is $\dfrac{d}{dx}x^n = n x^{n-1}$ , and with the sum and constant-multiple rules a polynomial differentiates term by term.
Product: $(fg)' = f'g + fg'$ . Quotient: $\left(\tfrac{f}{g}\right)' = \dfrac{f'g - fg'}{g^2}$ . Exponential: $\dfrac{d}{dx}e^x = e^x$ . Log: $\dfrac{d}{dx}\ln x = \tfrac1x$ .
Higher derivatives iterate the operation; $f''$ is curvature and its sign marks valleys ( $>0$ ) versus peaks ( $<0$ ).
In ML the derivative of a loss $\dfrac{dL}{dw}$ says which way to nudge a parameter: $w \leftarrow w - \eta\,\dfrac{dL}{dw}$ . Sensitivity generalizes to the gradient.
The central difference $\dfrac{f(x+h)-f(x-h)}{2h}$ is $O(h^2)$ accurate and beats the forward difference. Choosing $h$ trades truncation error (falls with $h$ ) against round-off (grows as $h\to0$ ); the error is U-shaped with a sweet spot near $h\approx10^{-5}$ . Central differences are the tool for gradient checking hand-written derivatives.

Active recall

Answer from memory before checking the lesson:

Write the definition of $f'(x)$ as a limit, and give the three readings of the number it produces (slope, rate, sensitivity).
Differentiate $x^2$ from first principles. Where in the algebra does the limit become safe to evaluate?
State the product and quotient rules. Why is $(fg)'$ not equal to $f'g'$ ?
What are $\dfrac{d}{dx}e^x$ and $\dfrac{d}{dx}\ln x$ ?
Why does the central difference beat the forward difference for the same $h$ , and why does making $h$ extremely small hurt accuracy?

Exercises

Level ARecall & basic calculation

Level AHand calculationch14-A1

Differentiate a polynomial, evaluate at a point

Let $f(x) = 3x^4 - 5x^2 + 7$ . Using the power, sum, and constant-multiple rules, compute $f'(1)$ .

Level AHand calculationch14-A2

Power rule at a point

Let $f(x) = x^5$ . Compute $f'(2)$ .

Level AHand calculationch14-A3

Derivative of the exponential

Let $f(x) = e^x$ . Compute $f'(0)$ . (Recall $e^0 = 1$ .)

Level AHand calculationch14-A4

Derivative of the natural log

Let $f(x) = \ln x$ . Compute $f'(2)$ .

Level AHand calculationch14-A5

Constant multiple rule

Let $f(x) = 4x^3$ . Compute $f'(2)$ .

Level AEquation interpretationch14-A6

Derivative of a constant

What is $\frac{d}{dx}(7)$ — the derivative of the constant function $f(x) = 7$ ?

Level AHand calculationch14-A7

Power rule with a fractional exponent

Let $f(x) = \sqrt{x} = x^{1/2}$ . Compute $f'(4)$ .

Level BConceptual understanding

Level BHand calculationch14-B1

Apply the product rule

Let $h(x) = x^2 e^x$ . Using the product rule, compute $h'(1)$ . (Use $e \approx 2.71828$ .)

Level BHand calculationch14-B2

Apply the quotient rule

Let $h(x) = \dfrac{x}{x+1}$ . Using the quotient rule, compute $h'(1)$ .

Level BML applicationch14-B3

Sensitivity and the descent direction

A loss $L(w)$ has derivative $\frac{dL}{dw}\big|_{w=2} = +3$ . To decrease the loss, should you increase or decrease $w$ , and roughly how much does $L$ change if you nudge $w$ by $-0.01$ ?

Level BHand calculationch14-B4

A second derivative

Let $f(x) = x^3$ . Compute the second derivative $f''(2)$ , and state whether the curve is locally cupping upward or downward there.

Level BEquation interpretationch14-B5

Why central beats forward

The forward difference $\frac{f(x+h) - f(x)}{h}$ has error $O(h)$ ; the central difference $\frac{f(x+h) - f(x-h)}{2h}$ has error $O(h^2)$ . For a small $h$ , which is more accurate, and by roughly what factor does the central-difference error shrink when you halve $h$ ?

Level CDerivation & implementation

Level CDerivationch14-C1

Differentiate x³ from first principles

Using only the limit definition $f'(x) = \lim_{h\to0}\frac{f(x+h)-f(x)}{h}$ , derive the derivative of $f(x) = x^3$ . Show the cancellation of $h$ before taking the limit.

Level CNumPy implementationch14-C2

Implement the central-difference derivative

Implement numerical_derivative(f, x, h=1e-5) using the central difference $\frac{f(x+h)-f(x-h)}{2h}$ . Verify it against the analytic derivative for $f(x) = x^2$ (which is $2x$ ) and $f(x) = e^x$ at a couple of points using np.isclose, then print ok.

Level CDerivationch14-C3

Derive the quotient rule from the product rule

Assuming the product rule $(fg)' = f'g + fg'$ and the chain-rule fact $\frac{d}{dx}\big[g(x)^{-1}\big] = -g^{-2}g'$ , derive the quotient rule $\left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2}$ .

Level CNumerical experimentch14-C4

Sweep h to find the error minimum

For $f(x) = e^x$ at $x = 1$ , compute the central-difference error $\big|\text{numeric} - e\big|$ across $h = 10^{-1}, 10^{-2}, \ldots, 10^{-12}$ . Print each error, confirm with an assert that the minimum occurs at an interior $h$ (not the smallest one), then print ok.

Level DResearch-thinking challenge

Level DPaper-reading practicech14-D1

Designing a gradient check

You want to gradient-check a hand-written backprop by comparing its analytic gradient $g_a$ against a numerical gradient $g_n$ . (a) Why do practitioners compare the relative error $\frac{|g_a - g_n|}{|g_a| + |g_n| + \epsilon}$ rather than the absolute error? (b) Why use the central difference rather than the forward difference here? (c) Name one failure mode where a correct analytic gradient still fails a naive gradient check, and how to avoid it.

Level DPaper-reading practicech14-D2

Why h ≈ cube-root of machine epsilon

For the central difference, the total error is roughly $E(h) \approx C_1 h^2 + \frac{C_2 \varepsilon}{h}$ , where the first term is truncation and the second is round-off ( $\varepsilon$ is machine epsilon, $\approx 2.2\times10^{-16}$ ). Minimize $E$ over $h$ to explain why the optimal step is on the order of $\varepsilon^{1/3} \approx 10^{-5}$ , and state what that makes the best achievable error.

Derivatives

Prerequisites

Learning objectives

Why a single number tells you where to move

Intuition: slope, and then sensitivity

The derivative from first principles

Doing it once by hand: the slope of $x^2$

The power rule, and the rule table

The rules you will actually reuse

Higher derivatives

ML use case: sensitivity, and gradient checking

Computing derivatives numerically

Summary

Active recall

Exercises

Level ARecall & basic calculation

Differentiate a polynomial, evaluate at a point

Power rule at a point

Derivative of the exponential

Derivative of the natural log

Constant multiple rule

Derivative of a constant

Power rule with a fractional exponent

Level BConceptual understanding

Apply the product rule

Apply the quotient rule

Sensitivity and the descent direction

A second derivative

Why central beats forward

Level CDerivation & implementation

Differentiate x³ from first principles

Implement the central-difference derivative

Derive the quotient rule from the product rule

Sweep h to find the error minimum

Level DResearch-thinking challenge

Designing a gradient check

Why h ≈ cube-root of machine epsilon

Related lessons

Prerequisites

Learning objectives

Why a single number tells you where to move

Intuition: slope, and then sensitivity

The derivative from first principles

Doing it once by hand: the slope of x2x^2x2

The power rule, and the rule table

The rules you will actually reuse

Higher derivatives

ML use case: sensitivity, and gradient checking

Computing derivatives numerically

Summary

Active recall

Exercises

Level ARecall & basic calculation

Differentiate a polynomial, evaluate at a point

Power rule at a point

Derivative of the exponential

Derivative of the natural log

Constant multiple rule

Derivative of a constant

Power rule with a fractional exponent

Level BConceptual understanding

Apply the product rule

Apply the quotient rule

Sensitivity and the descent direction

A second derivative

Why central beats forward

Level CDerivation & implementation

Differentiate x³ from first principles

Implement the central-difference derivative

Derive the quotient rule from the product rule

Sweep h to find the error minimum

Level DResearch-thinking challenge

Designing a gradient check

Why h ≈ cube-root of machine epsilon

Related lessons

Doing it once by hand: the slope of $x^2$