Limits and Continuity

Why limits sit under every gradient

Everything you will do to train a model comes down to one question: if I nudge a parameter a tiny bit, how does the loss respond? That "response to a tiny nudge" is a derivative, and a derivative is defined as a limit — the value a ratio approaches as the nudge shrinks toward zero. Backpropagation is nothing but the chain rule applied to millions of these limits. So before we can differentiate anything (next chapter) or descend any gradient (the chapter after), we need to be fluent in the one idea underneath them all: what it means for a function to approach a value.

Limits also explain a class of bugs you will actually hit. Why does 0/0 show up in a correctly-derived formula and still give the right answer once simplified? Why does a finite-difference gradient check get worse when you make the step too small? Why is ReLU — a function every network uses — differentiable everywhere except one point, and what do frameworks do at that point? All three are limit questions, and this chapter answers them.

Intuition: approaching, not arriving

Here is the whole idea in one sentence: a limit describes where a function is headed, not where it lands. Those can differ, and keeping them separate is the entire game.

Take $f(x) = \dfrac{x^2 - 1}{x - 1}$ . At exactly $x = 1$ this is $\frac{0}{0}$ — undefined, a hole in the graph. The function has no value there. And yet, as $x$ creeps toward $1$ from either side, the outputs march steadily toward $2$ . Plug in $0.9$ and you get $1.9$ ; plug in $0.99$ and you get $1.99$ ; plug in $1.001$ and you get $2.001$ . The function never reaches $x = 1$ , but it is unmistakably aiming at $2$ . That destination — $2$ — is the limit.

This is why "just plug in the point" is not what a limit means. Sometimes plugging in works (for nice functions it does), but the definition is about the approach. The value at the point is a separate question, and continuity, later in this chapter, is exactly the statement that the two happen to agree.

Formal definitions

Because $x$ may approach from two directions, we split the idea into one-sided limits. The left-hand limit uses only $x < a$ and the right-hand limit uses only $x > a$ :

\lim_{x \to a^-} f(x) \quad (\text{approach from } x < a), \qquad \lim_{x \to a^+} f(x) \quad (\text{approach from } x > a)

(13.1)

The two-sided limit exists only when both one-sided limits exist and are equal. If the function approaches $3$ from the left and $5$ from the right, there is no single destination, and $\lim_{x\to a} f(x)$ does not exist. This is the exact mechanism behind a "jump" in a step function.

The three conditions name the three ways continuity can fail, i.e. the three kinds of discontinuity:

Symbol	Meaning	Type	Shape	Role
$\text{removable}$	Limit exists but ≠ f(a), or f(a) undefined (a hole)	discontinuity	—	fixable
$\text{jump}$	Left and right limits exist but disagree (a step)	discontinuity	—	structural
$\text{infinite}$	f(x) blows up to ±∞ near a (a vertical asymptote)	discontinuity	—	structural
$L$	The value f(x) approaches as x → a	scalar	1	target
$f(a)$	The actual value at a (may differ from L, or not exist)	scalar	1	value

A removable discontinuity is the hole we just saw: patch a single point and the function becomes continuous. A jump cannot be patched — the one-sided limits genuinely disagree. An infinite discontinuity means the limit is not a finite number at all.

Evaluating a 0/0 limit by factoring

When direct substitution gives a definite number, that number is the limit — polynomials and other "nice" functions cooperate. The interesting case is the indeterminate form $\frac{0}{0}$ : both numerator and denominator vanish at $a$ , so substitution tells you nothing. The standard move is factor and cancel the term that is causing both to be zero.

Worked Example — a 0/0 limit that factors

Evaluate $\displaystyle\lim_{x \to 1} \frac{x^2 - 1}{x - 1}$ .

Substitution gives $\frac{1 - 1}{1 - 1} = \frac{0}{0}$ — indeterminate. Factor the numerator as a difference of squares: $\frac{x^2 - 1}{x - 1} = \frac{(x - 1)(x + 1)}{x - 1} = x + 1 \quad (\text{for } x \neq 1).$ The offending $(x - 1)$ cancels. Away from the hole the function is $x + 1$ , so $\lim_{x \to 1} \frac{x^2 - 1}{x - 1} = \lim_{x \to 1} (x + 1) = 2.$ The cancellation is legal precisely because the limit never uses $x = 1$ ; it only uses $x$ near $1$ , where dividing by $x - 1 \neq 0$ is fine.

Watch the same limit numerically. The table below evaluates the original unfactored expression as $x$ closes in on $1$ from both sides:

Symbol	Meaning	Type	Shape	Role
$x = 0.9$	f(x) = 1.9	left	→	approach
$x = 0.99$	f(x) = 1.99	left	→	approach
$x = 0.999$	f(x) = 1.999	left	→	approach
$x = 1$	f(x) = 0/0 — undefined (the hole)	point	✗	gap
$x = 1.001$	f(x) = 2.001	right	←	approach
$x = 1.01$	f(x) = 2.01	right	←	approach
$x = 1.1$	f(x) = 2.1	right	←	approach

Both columns converge on $2$ even though the middle row does not exist. The limit is a statement about the rows around the gap, never the gap itself.

Derivation: the difference quotient is a limit

Now the payoff. The slope of a straight line is rise over run, $\frac{\Delta y}{\Delta x}$ . A curve has no single slope, but it has a slope at each point, and we get it by a limit. Fix a point $a$ and a small step $h \neq 0$ . The line through the two points $(a, f(a))$ and $(a + h, f(a + h))$ — a secant — has slope

\frac{f(a + h) - f(a)}{h}

(13.2)

This ratio is the difference quotient. Note that at $h = 0$ it is exactly $\frac{0}{0}$ — the same indeterminate form as before, and for the same reason. We never evaluate it at $h = 0$ ; we take the limit as $h$ approaches $0$ . As $h$ shrinks, the secant line pivots and settles onto the tangent line, and its slope settles onto the instantaneous rate of change. That limiting value is the derivative:

f'(a) \;=\; \lim_{h \to 0} \frac{f(a + h) - f(a)}{h}

(13.3)

why the 0/0 resolves — a concrete slope

Take $f(x) = x^2$ and compute $f'(a)$ from the definition. The difference quotient is $\frac{(a + h)^2 - a^2}{h} = \frac{a^2 + 2ah + h^2 - a^2}{h} = \frac{2ah + h^2}{h}.$ Both top and bottom vanish at $h = 0$ — indeterminate. But factor an $h$ out of the numerator and cancel (legal, since $h \neq 0$ in the limit): $\frac{h(2a + h)}{h} = 2a + h.$ Now the limit is trivial: as $h \to 0$ , the leftover $h$ disappears and $f'(a) = \lim_{h \to 0} (2a + h) = 2a.$ The derivative of $x^2$ is $2x$ . Every derivative rule you will meet is this same maneuver — form the $\frac{0}{0}$ quotient, cancel the $h$ that causes it, read off what remains — done once, in general, so you never have to take the limit by hand again.

A function is differentiable at $a$ exactly when this limit exists — which, being a limit, requires the left and right difference quotients to agree. A well-behaved, two-sided limit is not a technicality; it is the whole requirement.

ML use case: gradients are limits, and ReLU has a kink

Two facts from this chapter run straight through modern deep learning.

Gradients are difference-quotient limits. The gradient of a loss with respect to a weight $w_i$ is $\partial L / \partial w_i$ , and each such partial is exactly the limit in eq. 13.3 taken along that one coordinate. When you cannot get the derivative in closed form, you approximate the limit by stopping $h$ at a small finite value — the finite-difference check used to validate a hand-written backprop: $f'(a) \approx \frac{f(a + h) - f(a - h)}{2h} \quad (\text{small } h).$ This is a limit you deliberately do not finish taking. It is trustworthy only in the window where $h$ is small enough to be accurate but not so small that floating-point error swamps it — a tension we make concrete below.

ReLU is continuous everywhere but not differentiable at $0$ . The rectifier $\mathrm{ReLU}(x) = \max(0, x)$ is continuous at $0$ : the left limit, the right limit, and the value all equal $0$ , so it passes the continuity test. But the derivative limit fails there. Approaching from the right the difference quotient is $1$ (the graph is the line $y = x$ ); approaching from the left it is $0$ (the graph is flat). The two one-sided slopes disagree, so $\lim_{h\to 0}$ of the difference quotient does not exist — a kink. Frameworks handle this by picking a value from the valid range $[0, 1]$ (a subgradient); PyTorch and TensorFlow return $0$ at exactly $x = 0$ by convention. It works in practice because a single input landing exactly on $0$ is a measure-zero event, and the choice within $[0,1]$ rarely changes the direction of a step. Continuity buys you "no jumps"; differentiability is the stronger promise ReLU cannot keep at one point.

NumPy: watch a limit converge — then break it

Let us numerically approach the derivative of $f(x) = x^2$ at $a = 3$ (true value $2a = 6$ ) by shrinking $h$ . First we see the difference quotient converge; then we push $h$ too small and watch floating-point catastrophic cancellation destroy the answer. Run it:

approach_a_limit.py

import numpy as np

# f(x) = x^2, whose exact derivative is f'(x) = 2x. At a = 3 the answer is 6.
def f(x):
  return x ** 2

a = 3.0
exact = 2 * a  # = 6.0, from the analytic cancellation h(2a + h)/h -> 2a

print("     h        estimate        error")
for h in [1e-1, 1e-2, 1e-4, 1e-8, 1e-12, 1e-15]:
  # forward difference quotient: (f(a+h) - f(a)) / h
  est = (f(a + h) - f(a)) / h
  print(f"{h:8.0e}   {est:12.8f}   {abs(est - exact):.3e}")

# Sweet spot: moderately small h. Estimate approaches 6 as h shrinks...
mid = (f(a + 1e-6) - f(a)) / 1e-6
assert abs(mid - exact) < 1e-3, "should be close for small-but-not-tiny h"

# ...but h = 1e-15 is a DISASTER: f(a+h) and f(a) are nearly identical floats,
# so their subtraction loses almost all significant digits (catastrophic
# cancellation), and dividing the noisy tiny difference by a tiny h amplifies it.
tiny = (f(a + 1e-15) - f(a)) / 1e-15
print("h=1e-15 estimate:", tiny, "(garbage, not ~6)")
assert abs(tiny - exact) > 0.1, "too-small h ruins the estimate"

# Analytic cancellation has no such problem: (a+h)^2 - a^2 = 2ah + h^2,
# so the quotient is exactly 2a + h with no subtraction of near-equal numbers.
h = 1e-15
analytic = 2 * a + h  # the cancelled form
assert np.isclose(analytic, exact), "the cancelled form stays exact"
print("cancelled form 2a + h:", analytic)

print("ok")

The table tells the whole story: the estimate marches toward $6$ as $h$ falls from $10^{-1}$ to about $10^{-8}$ , then reverses and degrades as $h$ keeps shrinking. The limit is mathematically exact at $h \to 0$ , but the floating-point evaluation has a floor. The algebraic cancellation $\frac{h(2a+h)}{h} = 2a + h$ sidesteps the whole problem — which is exactly why we cancel on paper before ever touching a computer.

Two traps: blind substitution and catastrophic cancellation

Do not plug the point in blindly. For an indeterminate $\frac{0}{0}$ , substitution is not just wrong, it hides a real answer. Always check for a common factor to cancel (or a limit rule) before concluding a limit fails to exist.

Do not make $h$ too small in floating point. $\frac{f(a+h)-f(a)}{h}$ looks more accurate as $h \to 0$ mathematically, but numerically $f(a+h)$ and $f(a)$ become nearly equal doubles; subtracting them cancels their leading digits and leaves noise, which the division by tiny $h$ then magnifies. For a plain forward difference the error is smallest around $h \approx \sqrt{\varepsilon} \approx 10^{-8}$ in double precision — not at the smallest $h$ you can type. The mathematically ideal limit and the numerically safe step are different quantities.

Summary

A limit $\lim_{x\to a} f(x) = L$ says $f(x)$ approaches $L$ as $x$ nears $a$ ; it is about the approach, not the value $f(a)$ , which may differ or not exist.
The two-sided limit exists iff the left- and right-hand limits both exist and agree. Disagreement is a jump; agreement-but-mismatch-with- $f(a)$ is a removable hole.
Continuity at $a$ means $f(a)$ exists, the limit exists, and they are equal. The three failures are removable, jump, and infinite discontinuities.
For an indeterminate $\frac{0}{0}$ , substitution is uninformative; factor and cancel the vanishing term, then take the limit of what remains.
The derivative is the limit of the difference quotient $f'(a) = \lim_{h\to 0}\frac{f(a+h)-f(a)}{h}$ — itself a $\frac{0}{0}$ resolved by cancelling $h$ . Differentiability requires this two-sided limit to exist.
In ML, gradients are these limits; ReLU is continuous but non-differentiable at $0$ (frameworks use a subgradient), and finite-difference checks fail if $h$ is pushed too small (catastrophic cancellation).

Active recall

Answer from memory before checking the lesson:

State the three conditions for $f$ to be continuous at $a$ . Which one fails for a removable discontinuity?
Evaluate $\displaystyle\lim_{x\to 2}\frac{x^2 - 4}{x - 2}$ by factoring. Why is cancelling $(x-2)$ legal even though it is zero at $x = 2$ ?
Write the definition of $f'(a)$ as a limit. What indeterminate form does the difference quotient take at $h = 0$ , and how is it resolved?
ReLU is continuous at $0$ but not differentiable there. Explain both halves in terms of one-sided limits.
Why does making $h$ smaller eventually make a finite-difference gradient estimate worse rather than better?

Exercises

Level ARecall & basic calculation

Level AHand calculationch13-A1

Limit by direct substitution

Evaluate $\displaystyle\lim_{x \to 2} (3x + 1)$ . (The function is a polynomial, so it is continuous everywhere.)

Level AHand calculationch13-A2

A 0/0 limit by factoring

Evaluate $\displaystyle\lim_{x \to 3} \frac{x^2 - 9}{x - 3}$ . (Substitution gives $0/0$ ; factor first.)

Level AEquation interpretationch13-A3

One-sided limits and existence

For a piecewise function, $\lim_{x\to a^-} f(x) = 4$ and $\lim_{x\to a^+} f(x) = 4$ , but $f(a) = 9$ . What is $\lim_{x\to a} f(x)$ ?

Level AHand calculationch13-A4

Difference quotient of a linear function

For $f(x) = 5x + 2$ , the difference quotient $\frac{f(a+h)-f(a)}{h}$ simplifies to a constant. What is $f'(a)$ (the limit as $h\to 0$ )?

Level AEquation interpretationch13-A5

Classify the discontinuity

A step function has $\lim_{x\to a^-} f(x) = 0$ and $\lim_{x\to a^+} f(x) = 1$ . Which type of discontinuity is at $a$ ?

Level AEquation interpretationch13-A6

Continuity of ReLU at zero

Is $\mathrm{ReLU}(x) = \max(0, x)$ continuous at $x = 0$ ? Enter $1$ for yes, $0$ for no.

Level BConceptual understanding

Level BEquation interpretationch13-B1

Limit vs. value

Which statement best captures why $\lim_{x\to a} f(x)$ can exist even when $f(a)$ is undefined?

Level BML applicationch13-B2

Why ReLU is not differentiable at 0

Explain, in terms of one-sided limits of the difference quotient, why $\mathrm{ReLU}$ is not differentiable at $0$ even though it is continuous there. What do frameworks do at that point?

Level BEquation interpretationch13-B3

The indeterminate form in the derivative

At $h = 0$ , the difference quotient $\frac{f(a+h)-f(a)}{h}$ has which form, and how is a finite derivative recovered from it?

Level BML applicationch13-B4

Why the finite-difference step can't be too small

A colleague validates a gradient with $\frac{f(a+h)-f(a)}{h}$ and reasons: 'smaller $h$ is always closer to the true limit, so I'll use $h = 10^{-15}$ .' Explain why this makes the numerical estimate worse, not better.

Level CDerivation & implementation

Level CDerivationch13-C1

Derive the derivative of a cubic

Using the limit definition $f'(a) = \lim_{h\to 0}\frac{f(a+h)-f(a)}{h}$ , derive $f'(a)$ for $f(x) = x^3$ . Show the $\frac{0}{0}$ cancellation explicitly.

Level CNumPy implementationch13-C2

Numerically approach a limit

Write numeric_derivative(f, a, h) returning the forward difference quotient $\frac{f(a+h)-f(a)}{h}$ . For $f(x)=x^2$ at $a=3$ , print the estimate for $h = 10^{-1}, 10^{-2}, 10^{-4}, 10^{-6}$ , assert the $h=10^{-6}$ estimate is within $10^{-3}$ of the exact value $6$ , then print ok.

Level CNumPy implementationch13-C3

Detect a discontinuity numerically

For the step function $f(x) = 0$ if $x < 0$ else $1$ , estimate the left and right limits at $a=0$ by evaluating $f$ at $a-h$ and $a+h$ for a shrinking sequence of $h$ . Print both estimates, assert they differ (confirming a jump discontinuity), and print ok.

Level DResearch-thinking challenge

Level DPaper-reading practicech13-D1

Subgradients and where non-differentiability bites

Deep networks are trained with gradient descent, yet ReLU, max-pooling, and the L1 penalty $|w|$ are all non-differentiable at isolated points. Explain why training still works (invoke measure-zero and subgradients), then give one concrete situation where a non-differentiable point does cause real trouble and how practitioners mitigate it.