Introductory Partial Derivatives

Why a single derivative is not enough

Back in the one-variable world, the derivative $f'(x)$ answered one clean question: if I nudge the input, how does the output respond? That number is the engine of optimization — it tells you which way is downhill. But every model you will ever train has more than one knob. A tiny logistic-regression model has a dozen parameters; a transformer has billions. The loss $L$ is a function not of one number but of a whole vector of parameters $\boldsymbol\theta$ .

So we need a derivative that works when there are many inputs at once. The idea is almost embarrassingly simple: vary one input at a time, hold the rest still. That single move — differentiate with respect to one variable while freezing the others — is the partial derivative, and collecting all of them into one vector gives the gradient, the object that training actually computes and follows. This chapter is the bridge from "the slope of a curve" to "the direction a billion-parameter model should step."

Intuition: slope along one axis, and a compass pointing uphill

Picture a surface $z = f(x, y)$ — a landscape of hills and valleys sitting above the $xy$ -plane. You are standing at one point on it. There are now two independent ways to walk:

Walk east (increase $x$ , keep $y$ fixed). The steepness you feel is the partial derivative $\partial f / \partial x$ — the ordinary one-variable slope of the curve you get by slicing the surface along the $x$ -direction.
Walk north (increase $y$ , keep $x$ fixed). That steepness is $\partial f / \partial y$ .

Each partial is just a familiar single-variable derivative in disguise: pretend every other variable is a constant and differentiate as usual. Holding $y$ fixed turns $f(x, y)$ into a function of $x$ alone, and you already know how to differentiate that.

Now the payoff. You have two slopes, east and north. Bundle them into a vector $\nabla f = (\partial f/\partial x,\ \partial f/\partial y)$ . Remarkably, this one vector points in the direction of steepest ascent — the single compass bearing that climbs the hill fastest from where you stand. Its length says how steep that fastest climb is. Turn it around, $-\nabla f$ , and you have the direction of steepest descent: the way training walks to reduce the loss.

Interactive LabGradient-Descent Visualizer

Loading interactive lab…

Drag the starting point and press step. Each move goes in the direction of $-\nabla L$ — straight downhill according to the local gradient. Watch how the path always cuts across the contour lines at right angles, never along them. We will see below why that perpendicularity is not a coincidence.

Formal definitions

Symbol	Meaning	Type	Shape	Role
$f$	A scalar-valued function of several variables	function	ℝⁿ→ℝ	fixed
$\frac{\partial f}{\partial x_i}$	Partial derivative: slope along axis i, others held fixed	function	ℝⁿ→ℝ	derived
$\nabla f$	Gradient: vector of all partials; points uphill	vector	n	derived
$\nabla f(\mathbf{a})$	Gradient evaluated at a point a (a concrete vector)	vector	n	variable
$\boldsymbol\theta$	The parameter vector of a model	vector	p	variable
$L$	A scalar loss to be minimized	function	ℝᵖ→ℝ	fixed

The notation $\nabla$ ("nabla" or "del") is standard. Read $\nabla f$ as "grad $f$ ." Because the gradient has one component per input, $\nabla L$ for a model with $p$ parameters is a $p$ -dimensional vector — one number per parameter, telling you how the loss responds to nudging that parameter alone.

Worked example: both partials and the gradient

Worked Example — the gradient of f(x, y) = x² + xy + y²

Take $f(x, y) = x^2 + xy + y^2$ . We compute each partial by freezing the other variable.

Partial with respect to $x$ (treat $y$ as a constant): $\frac{\partial f}{\partial x} = \underbrace{2x}_{\text{from } x^2} + \underbrace{y}_{\text{from } xy} + \underbrace{0}_{\text{from } y^2} = 2x + y.$ The term $xy$ differentiates to $y$ because, with $y$ frozen, $xy$ is just (constant) $\,\times x$ . The term $y^2$ is a pure constant here, so it vanishes.

Partial with respect to $y$ (now treat $x$ as a constant): $\frac{\partial f}{\partial y} = \underbrace{0}_{\text{from } x^2} + \underbrace{x}_{\text{from } xy} + \underbrace{2y}_{\text{from } y^2} = x + 2y.$

The gradient stacks the two: $\nabla f(x, y) = (2x + y,\ \ x + 2y).$

Evaluate at the point $(1, 2)$ : $\nabla f(1, 2) = \big(2(1) + 2,\ \ 1 + 2(2)\big) = (4,\ 5).$ So at $(1, 2)$ the surface climbs fastest in the direction $(4, 5)$ , at a rate of $\lVert(4,5)\rVert = \sqrt{41} \approx 6.40$ per unit step. To descend — to reduce $f$ — you would move along $-(4, 5) = (-4, -5)$ .

f(x,y) = x^2 + xy + y^2 \quad\Longrightarrow\quad \nabla f = (2x + y,\ \ x + 2y)

(17.1)

Why the gradient points the steepest way uphill

Why should the humble list of axis-slopes happen to encode the single best direction to climb? Here is the intuition, no heavy machinery required.

steepest ascent, and perpendicular to contours

Stand at a point and consider walking in some unit direction $\mathbf{u}$ . The rate at which $f$ increases as you step that way — the directional derivative — turns out to be the dot product $D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}.$ Intuitively: your step $\mathbf{u}$ has some amount of "east" and some "north"; the total change in $f$ is the east-component times the east-slope plus the north-component times the north-slope — exactly a dot product of $\mathbf{u}$ with $\nabla f$ .

Now recall the geometric form of the dot product, $\nabla f \cdot \mathbf{u} = \lVert \nabla f \rVert\,\lVert \mathbf{u}\rVert \cos\theta$ . With $\mathbf{u}$ a unit vector this is $\lVert \nabla f \rVert \cos\theta$ , which is largest when $\theta = 0$ — that is, when $\mathbf{u}$ points the same way as $\nabla f$ . So among all directions you could step, the gradient is the one that increases $f$ fastest. That is the whole claim.

Two corollaries fall out for free. Step against the gradient ( $\theta = 180°$ ) and you descend fastest — this is gradient descent. And step along a contour line, where $f$ does not change at all, and the rate must be zero: $\nabla f \cdot \mathbf{u} = 0$ , so the gradient is perpendicular to the contour through your point. That is why the descent path in the lab always crosses the level curves at right angles.

Reading a contour plot is now a superpower. Where contour lines bunch tightly, the function is changing fast, so $\lVert \nabla f \rVert$ is large. Where they spread out, the gradient is small — near a flat minimum the contours are far apart and $\nabla f \to \mathbf{0}$ . The gradient arrow at any point stabs straight across the contours, toward higher ground.

ML use case: the gradient of a loss over parameters

This is the reason the whole chapter exists. Training a model means choosing parameters $\boldsymbol\theta$ to make a scalar loss $L(\boldsymbol\theta)$ small. The loss is a function of possibly billions of parameters, and we want the direction that reduces it fastest. That direction is exactly the negative gradient.

\nabla L(\boldsymbol\theta) = \left( \frac{\partial L}{\partial \theta_1},\ \frac{\partial L}{\partial \theta_2},\ \ldots,\ \frac{\partial L}{\partial \theta_p} \right)

(17.2)

Each component $\partial L / \partial \theta_j$ answers a local question: "if I nudge this one parameter and leave the other $p-1$ alone, does the loss go up or down, and how fast?" Stack all $p$ answers and you have $\nabla L$ , a vector in the same space as the parameters. Gradient descent then updates every parameter by stepping a small amount opposite the gradient:

\boldsymbol\theta \;\leftarrow\; \boldsymbol\theta \;-\; \eta\,\nabla L(\boldsymbol\theta)

(17.3)

where $\eta > 0$ is the learning rate (the step size). The minus sign is the entire point: the gradient points uphill toward larger loss, so we subtract it to go downhill. Repeat this step millions of times and the parameters slide into a valley of low loss. That is training, stripped to its skeleton.

NumPy: numerical gradient vs. analytic

You can always check a hand-derived gradient by approximating each partial numerically. Nudge one coordinate by a tiny $h$ , see how $f$ changes, divide — that is the definition, done arithmetically. Using a central difference, $\frac{\partial f}{\partial x_i} \approx \frac{f(\mathbf{x} + h\,\mathbf{e}_i) - f(\mathbf{x} - h\,\mathbf{e}_i)}{2h}$ , is far more accurate than a one-sided nudge. We do it partial by partial, one coordinate at a time — exactly holding the others fixed — and compare to the formula $\nabla f = (2x + y,\ x + 2y)$ . Run it:

numerical_gradient.py

import numpy as np

# The function from the worked example: f(x, y) = x^2 + x*y + y^2
def f(v):
  x, y = v
  return x**2 + x*y + y**2

# Analytic gradient we derived by hand: (2x + y, x + 2y)
def grad_analytic(v):
  x, y = v
  return np.array([2*x + y, x + 2*y])

# Numerical gradient: central difference, one partial at a time.
def grad_numerical(f, v, h=1e-5):
  v = np.asarray(v, dtype=float)
  g = np.zeros_like(v)
  for i in range(v.shape[0]):
      step = np.zeros_like(v)
      step[i] = h                      # perturb ONLY coordinate i (others held fixed)
      g[i] = (f(v + step) - f(v - step)) / (2 * h)
  return g

point = np.array([1.0, 2.0])
analytic  = grad_analytic(point)
numerical = grad_numerical(f, point)

print("analytic  =", analytic)          # (4, 5)
print("numerical =", np.round(numerical, 6))

# They must agree to within numerical tolerance.
assert np.allclose(analytic, numerical, atol=1e-6), "gradients disagree!"
print("ok")

The loop is the definition of the gradient made literal: it walks the axes, perturbing one coordinate while the rest stay put, and records each slope. The central difference agrees with our algebra to six decimals. In real training you would never compute gradients this way — it costs two function evaluations per parameter, hopeless for billions — but it is the gold-standard gradient check for verifying that a hand-written analytic gradient (or a backprop implementation) is correct.

Common mistakes

Three ways to misuse the gradient

1. The held variable is a constant, not zero. When differentiating $f = xy$ with respect to $x$ , you get $y$ , not $0$ . Freezing $y$ means treating it as some fixed number like $7$ — and the derivative of $7x$ is $7$ , not $0$ . Only terms with no $x$ at all (like a lone $y^2$ ) drop to zero.

2. The gradient is a vector, not a scalar. $\nabla f$ has one component per input variable and lives in the same space as the input. If your "gradient" of a 3-variable function is a single number, you have collapsed something you should not have. Check the shape: $\nabla f \in \mathbb{R}^n$ .

3. Descent uses the negative gradient. The gradient points toward increasing $f$ . To minimize a loss you step $\boldsymbol\theta \leftarrow \boldsymbol\theta - \eta\,\nabla L$ , with a minus. Drop the sign and you perform gradient ascent — your loss climbs and training diverges. A surprising number of "why won't my model learn?" bugs are a flipped sign here.

Summary

A partial derivative $\partial f / \partial x_i$ is the ordinary single-variable derivative taken while holding every other variable fixed (treating them as constants). The $\partial$ symbol flags the frozen companions.
The gradient $\nabla f = (\partial f/\partial x_1, \ldots, \partial f/\partial x_n)$ collects all partials into one vector, the same dimension as the input.
The gradient points in the direction of steepest ascent; its length is the rate of climb; and it is perpendicular to the contours of $f$ . Negate it for steepest descent.
For $f(x,y) = x^2 + xy + y^2$ , $\nabla f = (2x + y,\ x + 2y)$ ; at $(1,2)$ this is $(4, 5)$ .
Training computes $\nabla L(\boldsymbol\theta)$ — the loss's partial with respect to every parameter (this is what backprop produces) — and gradient descent steps opposite it: $\boldsymbol\theta \leftarrow \boldsymbol\theta - \eta\,\nabla L$ .
A numerical gradient (central differences, one coordinate at a time) checks an analytic gradient but is far too slow to train with.

Active recall

Answer from memory before checking the lesson:

In words, how do you compute $\partial f / \partial x$ for a function of $x$ and $y$ ? What do you do with $y$ ?
Compute both partials of $f(x, y) = x^2 + xy + y^2$ and give $\nabla f$ at the point $(0, 3)$ .
What shape is $\nabla L$ for a model with $p$ parameters, and what does its $j$ -th component mean?
Why does the gradient-descent update in eq. 17.3 have a minus sign? What goes wrong if you use a plus?
Geometrically, what is the angle between the gradient and the contour line passing through the same point?

Exercises

Level ARecall & basic calculation

Level AHand calculationch17-A1

Gradient of the worked example

For $f(x, y) = x^2 + xy + y^2$ we found $\nabla f = (2x + y,\ x + 2y)$ . Evaluate the gradient at the point $(0, 3)$ . Enter as x, y.

Level AHand calculationch17-A2

A single partial derivative

Let $f(x, y) = 3x^2 + 2y$ . Compute $\dfrac{\partial f}{\partial x}$ and evaluate it at $x = 2$ (its value does not depend on $y$ ).

Level AHand calculationch17-A3

Partial of a product with powers

Let $f(x, y) = x^2 y^3$ . Compute $\dfrac{\partial f}{\partial y}$ and evaluate it at the point $(2, 1)$ .

Level AHand calculationch17-A4

Gradient of a linear function

Let $f(x, y, z) = x + 2y + 3z$ . Give the gradient $\nabla f$ . Enter as x, y, z.

Level AHand calculationch17-A5

The held variable is a constant, not zero

Let $f(x, y) = xy$ . Compute $\dfrac{\partial f}{\partial x}$ and evaluate it at the point $(5, 7)$ .

Level AEquation interpretationch17-A6

What does the gradient point toward?

At a given point, the gradient $\nabla f$ points in the direction of what?

Level BConceptual understanding

Level BGraph interpretationch17-B1

Gradient and contour lines

On a contour plot of $f$ , what is the angle between the gradient $\nabla f$ at a point and the contour line passing through that same point?

Level BEquation interpretationch17-B2

The sign in the descent step

Gradient descent updates parameters as $\boldsymbol\theta \leftarrow \boldsymbol\theta - \eta\,\nabla L$ . Why is there a minus sign?

Level BShape reasoningch17-B3

Shape of the loss gradient

A model has $p$ parameters collected in $\boldsymbol\theta \in \mathbb{R}^p$ , and $L(\boldsymbol\theta)$ is a scalar loss. What is the shape of $\nabla L$ ?

Level BML applicationch17-B4

Reading a single loss partial

During training you compute $\dfrac{\partial L}{\partial \theta_j} = -0.8$ for one parameter. In one or two sentences, say what this tells you about $\theta_j$ , and which way gradient descent will move it.

Level BError identificationch17-B5

Spot the differentiation error

A student differentiates $f(x, y) = x^2 + xy + y^2$ and writes $\dfrac{\partial f}{\partial x} = 2x$ , reasoning that the $xy$ and $y^2$ terms 'have $y$ in them, so they're zero.' What did they get wrong, and what is the correct $\partial f/\partial x$ ?

Level CDerivation & implementation

Level CNumPy implementationch17-C1

Implement a numerical gradient

Write numerical_gradient(f, v, h=1e-5) that approximates $\nabla f$ at a point v using central differences, one coordinate at a time. Test it on $f(x, y) = x^2 + xy + y^2$ at $(1, 2)$ , confirm it matches the analytic answer $(4, 5)$ with np.allclose, and print ok.

Level CDerivationch17-C2

Why steepest ascent

Assume the directional derivative in a unit direction $\mathbf{u}$ is $D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}$ . Using the geometric form of the dot product, show that $f$ increases fastest when $\mathbf{u}$ points along $\nabla f$ .

Level CHand calculationch17-C3

One gradient-descent step

Minimize $f(x, y) = x^2 + xy + y^2$ by gradient descent, starting at $\boldsymbol\theta_0 = (1, 2)$ with learning rate $\eta = 0.1$ . Recall $\nabla f = (2x + y,\ x + 2y)$ . Compute $\boldsymbol\theta_1 = \boldsymbol\theta_0 - \eta\,\nabla f(\boldsymbol\theta_0)$ . Enter as x, y.

Level DResearch-thinking challenge

Level DPaper-reading practicech17-D1

Numerical gradients vs. backprop

The central-difference numerical gradient is simple and always available, yet no one trains large networks with it. Explain the cost of the numerical approach for a model with $p$ parameters, why backpropagation is dramatically cheaper, and one situation where the numerical gradient is still the right tool.

Level DPaper-reading practicech17-D2

How big should the step be?

Gradient descent moves $\boldsymbol\theta \leftarrow \boldsymbol\theta - \eta\,\nabla L$ . The gradient gives the best direction, but not how far to go. Describe qualitatively what goes wrong when the learning rate $\eta$ is far too large, and what goes wrong when it is far too small. Why can't we just read the ideal step size off the gradient itself?