Chain Rule and Computational Graphs

Why the chain rule is the engine of deep learning

Every neural network you will ever train is one enormous composed function: raw pixels flow into a linear layer, then a nonlinearity, then another linear layer, then another nonlinearity, dozens of times, until a single loss number comes out the end. Training means asking one question over and over — if I nudge this weight, how does the loss change? — and adjusting the weight against that answer.

That question is a derivative of a deeply nested function. The tool that answers it is the chain rule, and the data structure that makes the answer cheap to compute is the computational graph. Put the two together and you have backpropagation — the algorithm that trains essentially every deep model in existence. This chapter builds it from the ground up, and by the end you will have written a tiny backprop pass by hand and checked it against the definition of the derivative.

Intuition: rates multiply through a chain

Suppose a gear train drives a wheel. Turning the crank one turn spins the middle gear 3 turns; each turn of the middle gear spins the wheel 2 turns. How fast does the wheel turn per turn of the crank? You multiply: $3 \times 2 = 6$ . Rates of change compose by multiplication.

The chain rule says exactly this for functions. If $y$ depends on $u$ , and $u$ depends on $x$ , then the sensitivity of $y$ to $x$ is the sensitivity of $y$ to $u$ times the sensitivity of $u$ to $x$ :

$\frac{dy}{dx} = \frac{dy}{du}\cdot\frac{du}{dx}.$

The intermediate variable $u$ acts like the middle gear. Notice how the notation almost begs you to "cancel" the $du$ — that is a useful mnemonic, though not a literal fraction cancellation. The real content is: a small change $dx$ produces a change $du = \frac{du}{dx}\,dx$ in the middle, which produces a change $dy = \frac{dy}{du}\,du$ at the end. Substitute one into the other and the local rates multiply.

Interactive LabChain-Rule Visualizer

Loading interactive lab…

The lab above traces our running example, $y = (wx+b)^2$ , through its intermediate steps. Drag the inputs and watch each local rate light up, then watch them multiply back along the arrows. Keep it open — everything below is a written account of what that picture is doing.

The chain rule, formally

Symbol	Meaning	Type	Shape	Role
$x$	Input variable we differentiate with respect to	scalar	1	variable
$u = g(x)$	Intermediate value (the inner function)	scalar	1	variable
$y = f(u)$	Output (the outer function)	scalar	1	variable
$\frac{du}{dx}$	Local derivative of the inner step	scalar	1	derivative
$\frac{dy}{du}$	Local derivative of the outer step	scalar	1	derivative
$\frac{dy}{dx}$	Total derivative (product of the locals)	scalar	1	derivative

The word local is the one to hold onto. Each step in a computation knows only its own tiny derivative — how its output moves when its input moves. The chain rule is the bookkeeping that assembles those purely local facts into the global answer.

A numerical example, two ways

Take the running function with a single variable in view. Fix $w$ and $b$ and ask for $\dfrac{dy}{dx}$ where

y = (wx + b)^2.

(15.1)

We will differentiate it twice, by two different routes, and confirm the answers are identical — that is the whole promise of the chain rule.

The two routes agree, as they must. The lesson is not that they agree — it is why the chain-rule route scales. Expanding worked here because the function was tiny. For a hundred-layer network there is no "expanded form" to write down; the chain rule, applied step by step, is the only feasible route.

Deriving it through the computational graph

To make the chain rule mechanical, we lay the computation out as a graph. Each node is one elementary operation; each edge carries a value forward and, later, a gradient backward. Our running example $y = (wx+b)^2$ breaks into three steps:

u = wx, \qquad z = u + b, \qquad y = z^2.

(15.2)

The graph has inputs $w, x, b$ feeding a multiply node ( $u$ ), an add node ( $z$ ), and a square node ( $y$ ).

Forward pass — compute values, left to right

Pick concrete inputs $w = 2$ , $x = 3$ , $b = 1$ and walk the graph forward, storing each intermediate result (we will need them in a moment):

u = wx = 6, \qquad z = u + b = 7, \qquad y = z^2 = 49.

(15.3)

That is an ordinary function evaluation. The one discipline that matters: save the intermediates $u$ and $z$ . Backprop reuses them, and recomputing them is the difference between a cheap algorithm and an exponentially expensive one.

Local derivatives — one per edge

Before flowing anything backward, write down each node's own derivative with respect to its inputs. These are the "gears," and each is trivial in isolation:

\frac{dy}{dz} = 2z, \qquad \frac{dz}{du} = 1, \qquad \frac{dz}{db} = 1, \qquad \frac{du}{dw} = x, \qquad \frac{du}{dx} = w.

(15.4)

The square node contributes $2z$ ; an addition node passes its gradient through untouched (derivative $1$ to each input); a multiply node hands each input the other factor.

Backward pass — multiply local derivatives back to the inputs

Now start at the output with $\dfrac{dy}{dy} = 1$ and flow right to left, at each edge multiplying by the local derivative. This accumulated quantity — the derivative of the final output with respect to the current node — is the node's gradient.

propagating the gradient backward

Start at $y$ and move toward each input, multiplying local derivatives as we cross each edge.

To $z$ : $\frac{dy}{dz} = 2z = 2(7) = 14.$

To $u$ (through the add node, whose local derivative is $1$ ): $\frac{dy}{du} = \frac{dy}{dz}\cdot\frac{dz}{du} = 14\cdot 1 = 14.$

To $b$ (the other input of the add node): $\frac{dy}{db} = \frac{dy}{dz}\cdot\frac{dz}{db} = 14\cdot 1 = 14.$

To $w$ (through the multiply node, local derivative $x$ ): $\frac{dy}{dw} = \frac{dy}{du}\cdot\frac{du}{dw} = 14\cdot x = 14\cdot 3 = 42.$

To $x$ (the other input of the multiply node, local derivative $w$ ): $\frac{dy}{dx} = \frac{dy}{du}\cdot\frac{du}{dx} = 14\cdot w = 14\cdot 2 = 28.$

Collecting the results in closed form (substitute $z = wx+b$ ):

\frac{dy}{dw} = 2z\,x, \qquad \frac{dy}{db} = 2z, \qquad \frac{dy}{dx} = 2z\,w.

(15.5)

Sanity check against the earlier example: $\dfrac{dy}{dx} = 2z\,w = 2(7)(2) = 28$ — exactly the number we got by expanding, and by the one-line chain rule. Three routes, one answer.

This is backpropagation

What we just did by hand is the backpropagation algorithm in full. There is no extra machinery hiding in PyTorch or JAX — a deep-learning framework builds the computational graph as your forward code runs, stores each intermediate, and then sweeps backward multiplying local derivatives, reusing those stored intermediates exactly as we reused $z$ and $u$ . Training a network is:

Forward pass — run the input through the graph to a scalar loss, caching intermediates.
Backward pass — seed $\frac{d\,\text{loss}}{d\,\text{loss}} = 1$ and propagate gradients back to every weight.
Update — nudge each weight against its gradient (next chapter: gradient descent).

Because the total derivative along any path is a product of local derivatives, the number of layers controls how many factors get multiplied — and that has a sharp practical consequence.

ML Connection

Vanishing and exploding gradients. The gradient reaching an early layer is a product of one local derivative per later layer. If each of those factors is consistently smaller than $1$ (common with saturating activations like sigmoid, whose slope maxes out at $0.25$ ), the product decays geometrically: $0.25^{20}\approx 10^{-13}$ . The early layers receive essentially no signal and stop learning — the vanishing gradient problem. If the factors are consistently larger than $1$ , the product explodes and training diverges. Almost every architectural staple of modern deep learning — ReLU activations, residual (skip) connections, careful weight initialization, normalization layers, gradient clipping — exists to keep this product of local derivatives near a healthy scale.

NumPy: forward, backward, and a gradient check

The real test of a backward pass is whether it matches the definition of the derivative. We approximate each true derivative with a finite difference, $\frac{dy}{d\theta}\approx\frac{y(\theta+\varepsilon)-y(\theta-\varepsilon)}{2\varepsilon}$ , and assert our analytic gradients agree. This "gradient check" is exactly how you debug a hand-written backprop in practice. Run it:

chain_rule_backprop.py

import numpy as np

# Running example: y = (w*x + b)**2, decomposed as u=w*x, z=u+b, y=z**2.

def forward(w, x, b):
  # Forward pass: compute values left to right, caching intermediates.
  u = w * x          # multiply node
  z = u + b          # add node
  y = z ** 2         # square node
  cache = (u, z, x, w)
  return y, cache

def backward(cache):
  # Backward pass: seed dy/dy = 1 and multiply local derivatives back.
  u, z, x, w = cache
  dy = 1.0
  dz = dy * (2.0 * z)   # local deriv of square: dy/dz = 2z
  du = dz * 1.0         # add node passes gradient through: dz/du = 1
  db = dz * 1.0         # other input of add: dz/db = 1
  dw = du * x           # multiply node hands over the OTHER factor: du/dw = x
  dx = du * w           # other input of multiply: du/dx = w
  return dw, dx, db

w, x, b = 2.0, 3.0, 1.0
y, cache = forward(w, x, b)
dw, dx, db = backward(cache)
print("forward y   =", y)             # 49.0
print("analytic grads dw,dx,db =", dw, dx, db)   # 42.0 28.0 14.0

# Finite-difference gradient check against the definition of the derivative.
def num_grad(f, eps=1e-6):
  return (f(eps) - f(-eps)) / (2 * eps)

fd_w = num_grad(lambda h: forward(w + h, x, b)[0])
fd_x = num_grad(lambda h: forward(w, x + h, b)[0])
fd_b = num_grad(lambda h: forward(w, x, b + h)[0])
print("finite-diff  dw,dx,db =", fd_w, fd_x, fd_b)

# The analytic backward pass must match the numerical derivative.
assert np.allclose([dw, dx, db], [fd_w, fd_x, fd_b], atol=1e-4)
print("ok")

Notice the shape of the backward function: it is a straight-line sequence of multiplications, one per node, walking the cache in reverse. That is all backpropagation ever is. Scale this pattern up to matrices and thousands of nodes and you have the training loop of a real network.

A variable used twice needs its gradients summed

The single most common backprop bug is a forgotten branch. If a variable feeds two downstream paths — say $y = x\cdot g(x)$ , where $x$ is used by both the multiply and by $g$ — then its total gradient is the sum of the gradients arriving from each path, not just one of them. This is the multivariable chain rule: $\frac{dy}{dx} = \sum_{\text{paths}} (\text{product along path})$ . Overwriting instead of accumulating (dx = ... where you meant dx += ...) silently drops a branch, and the gradient check will fail on exactly those variables. When your finite-difference check disagrees, look first for a node whose output fans out to more than one consumer.

Summary

The chain rule composes rates by multiplication: for $y = f(g(x))$ , $\dfrac{dy}{dx} = \dfrac{dy}{du}\dfrac{du}{dx}$ — outer derivative times inner derivative. A long chain multiplies one local derivative per link.
A computational graph turns a formula into nodes (operations) and edges (values forward, gradients backward). Our running example splits as $u=wx$ , $z=u+b$ , $y=z^2$ .
The forward pass computes and caches values; the backward pass seeds $\frac{dy}{dy}=1$ and multiplies local derivatives right-to-left, reusing the cached intermediates. For $(wx+b)^2$ this gives $\frac{dy}{dw}=2zx$ , $\frac{dy}{db}=2z$ , $\frac{dy}{dx}=2zw$ .
This bookkeeping is backpropagation — the algorithm that trains every neural network.
Because a deep gradient is a product of many local derivatives, it can vanish (factors $<1$ ) or explode (factors $>1$ ); much of modern architecture design exists to control that product.
A variable used on multiple paths sums the gradients from each path; forgetting one is the classic backprop bug, caught by a finite-difference gradient check.

Active recall

Answer from memory before checking the lesson:

State the chain rule for $y = f(g(x))$ and explain, in words, why the two local derivatives multiply rather than add.
For $y = (wx+b)^2$ with $w=2$ , $x=3$ , $b=1$ , run the forward pass and report $u$ , $z$ , and $y$ .
From that forward pass, run the backward pass to get $\frac{dy}{dw}$ , $\frac{dy}{db}$ , and $\frac{dy}{dx}$ . What role does the cached value $z$ play?
Why can a gradient vanish in a deep network, and what does that do to the early layers' learning?
A variable is used in two places in a graph. How do you combine the gradients coming back along the two paths, and what goes wrong if you don't?

Exercises

Level ARecall & basic calculation

Level AHand calculationch15-A1

Chain rule at a point

Let $y = (3x + 2)^2$ . Use the chain rule to compute $\dfrac{dy}{dx}$ at $x = 1$ .

Level AHand calculationch15-A2

Forward pass of the graph

For $y = (wx + b)^2$ decomposed as $u = wx$ , $z = u + b$ , $y = z^2$ , run the forward pass with $w = 1$ , $x = 4$ , $b = -2$ and report $y$ .

Level AHand calculationch15-A3

Local derivative of the square node

The square node computes $y = z^2$ . What is its local derivative $\dfrac{dy}{dz}$ evaluated at $z = 5$ ?

Level AHand calculationch15-A4

Gradient with respect to the bias

For $y = (wx + b)^2$ , the backward pass gives $\dfrac{dy}{db} = 2(wx+b)$ . Evaluate it at $w = 2$ , $x = 3$ , $b = 1$ .

Level AHand calculationch15-A5

Gradient through an add node

In the graph $z = u + b$ , the gradient arriving at $z$ is $\dfrac{dy}{dz} = 6$ . What gradient does the add node pass back to $u$ ?

Level AHand calculationch15-A6

Gradient through a multiply node

In the graph $u = wx$ , the gradient arriving at $u$ is $\dfrac{dy}{du} = 14$ and $x = 3$ . What gradient does the multiply node pass back to $w$ ?

Level BConceptual understanding

Level BEquation interpretationch15-B1

Why local derivatives multiply

For $y = f(g(x))$ with $u = g(x)$ , why is $\dfrac{dy}{dx} = \dfrac{dy}{du}\cdot\dfrac{du}{dx}$ a product of the two local rates rather than a sum?

Level BHand calculationch15-B2

Backward pass with new inputs

For $y = (wx + b)^2$ run forward with $w = -1$ , $x = 2$ , $b = 3$ , then use the backward pass to compute $\dfrac{dy}{dw}$ . (Recall $\dfrac{dy}{dw} = 2(wx+b)\,x$ .)

Level BML applicationch15-B3

Diagnosing a vanishing gradient

A 20-layer network uses an activation whose local derivative is at most $0.25$ everywhere. Roughly what happens to the gradient reaching the first layer, and why?

Level BML applicationch15-B4

Why cache the intermediates?

Backpropagation stores intermediate values (like $z$ in $y = z^2$ ) during the forward pass instead of recomputing them during the backward pass. In one or two sentences, explain why this matters for cost.

Level BEquation interpretationch15-B5

A variable used on two paths

In a graph, the variable $x$ feeds two different downstream nodes. When you run the backward pass, how do you combine the gradients arriving at $x$ from the two paths?

Level CDerivation & implementation

Level CNumPy implementationch15-C1

Implement backprop for (wx+b)²

Implement forward(w, x, b) returning $y=(wx+b)^2$ plus a cache, and backward(cache) returning $(\partial y/\partial w,\ \partial y/\partial x,\ \partial y/\partial b)$ by multiplying local derivatives back through $u=wx$ , $z=u+b$ , $y=z^2$ . Verify at $w=2,x=3,b=1$ that the gradients are $(42, 28, 14)$ , then print ok.

Level CDerivationch15-C2

Backprop a three-input product

Consider $y = (ab + c)^2$ with graph $u = ab$ , $z = u + c$ , $y = z^2$ . Derive $\dfrac{\partial y}{\partial a}$ , $\dfrac{\partial y}{\partial b}$ , and $\dfrac{\partial y}{\partial c}$ by the backward pass, then evaluate them at $a = 2$ , $b = 3$ , $c = -1$ .

Level CNumPy implementationch15-C3

Finite-difference gradient check

Write a central finite-difference checker for $y = (wx+b)^2$ . Use $\dfrac{\partial y}{\partial \theta} \approx \dfrac{y(\theta+\varepsilon) - y(\theta-\varepsilon)}{2\varepsilon}$ with $\varepsilon = 10^{-6}$ to approximate $\partial y/\partial w$ , $\partial y/\partial x$ , $\partial y/\partial b$ at $w=2,x=3,b=1$ , assert they match the analytic gradients $(42,28,14)$ with np.allclose, then print ok.

Level CNumerical experimentch15-C4

Simulate a vanishing-gradient product

The gradient at the first of $L$ layers is a product of $L$ local derivatives. Numerically compute the product of $L = 50$ factors each equal to $0.5$ , and separately each equal to $1.1$ , print both, and assert the first is below $10^{-10}$ and the second is above $100$ . End by printing ok.

Level DResearch-thinking challenge

Level DPaper-reading practicech15-D1

Why do gradients vanish, and how is it fixed?

Explain, using the chain rule, why deep networks with sigmoid activations suffer vanishing gradients. Then explain mechanistically how two of {ReLU activation, residual (skip) connections, batch/layer normalization} address it — pointing to what each does to the per-layer local derivative.

Level DPaper-reading practicech15-D2

Reverse-mode vs forward-mode autodiff

Backpropagation is reverse-mode automatic differentiation. For a function $f:\mathbb{R}^n \to \mathbb{R}$ (many inputs, scalar loss — the neural-network case), explain why reverse mode computes all $n$ input gradients in roughly one backward sweep, whereas forward-mode autodiff would need about $n$ passes. What does this asymmetry cost, and when would forward mode actually be preferable?