Part 4 · CalculusChapter 1680 min

Optimization Intuition

Minima, gradient descent, learning rates, and convexity

Learning objectives

  • Characterize minima/maxima with first and second derivatives
  • State and run the gradient-descent update rule
  • Diagnose convergence, slow learning, oscillation, and divergence
  • Explain why convexity makes optimization easy

Why every model is a minimization problem

Training a machine-learning model sounds like a bespoke, almost magical process: show it data, and somehow it "learns." Strip away the mystique and what remains is startlingly uniform. You write down a single number that says how wrong the model currently is — the loss — and then you nudge the model's parameters, over and over, in whatever direction makes that number go down. That is the whole game.

Linear regression, logistic regression, a 400-billion-parameter transformer: all three are the same sentence with different nouns. Each defines a loss function L(θ)L(\theta) of its parameters θ\theta, and each is trained by minimizing it. So the single most useful thing you can carry out of the calculus we have built is this: learning is optimization. This chapter is where derivatives stop being an exercise and become the engine.

Our goal is intuition first, then the one algorithm — gradient descent — that powers essentially all of modern deep learning, and an honest account of when it works, when it stalls, and when it explodes.

Intuition: rolling downhill in the fog

Picture the loss as a landscape. The horizontal position is your parameter vector θ\theta; the height above each point is the loss L(θ)L(\theta). Training means finding the lowest valley. Now add one cruel constraint: it is foggy, and you can only feel the slope under your feet — you cannot see the whole terrain.

What would you do? You would feel which way is downhill and take a step that way. Then feel again, step again. The slope is exactly the gradient L\nabla L, and "downhill" is its negative, L-\nabla L. The size of each step is the learning rate η\eta. Too timid and you inch along for hours; too bold and you leap clear over the valley and land higher up the far wall. The entire art of training lives in that tension.

Interactive LabGradient-Descent Visualizer
Loading interactive lab…

Drag the starting point and the learning-rate slider above. Watch the ball. With a sensible step it settles into the valley in a few bounces; crank η\eta up and it starts flinging itself outward, further every step. Hold both pictures — the smooth bowl and the runaway ball — in your head as we make them precise.

Formal definitions

The two tests you already own from single-variable calculus decide the matter. The first-derivative test locates candidates: solve f(x)=0f'(x) = 0. The second-derivative test classifies them:

The sign of ff'' is curvature: f>0f'' > 0 means the graph curves up like a bowl (a minimum sits at the bottom), f<0f'' < 0 means it curves down like a dome. When f(x\*)=0f''(x^\*) = 0 the test is inconclusive.

Worked example: minimizing x2x^2 by hand

Take the simplest possible bowl, f(x)=x2f(x) = x^2. Its derivative is f(x)=2xf'(x) = 2x, so the only critical point is 2x=0x=02x = 0 \Rightarrow x = 0. The second derivative is f(x)=2>0f''(x) = 2 > 0 everywhere, so that critical point is a minimum — as we knew. Now let gradient descent rediscover it.

Because the map is x(12η)xx \mapsto (1 - 2\eta)\,x, the entire behavior is governed by the single number 12η|1 - 2\eta|: strictly less than 11 and xx shrinks to zero; equal to 11 and it never settles; greater than 11 and it grows without bound. We will see all three in the code.

Why L-\nabla L is the right direction

Why step against the gradient rather than along it, or sideways? Because of the first-order Taylor expansion. Move from θ\theta by a small displacement Δθ\Delta\theta; the loss changes, to first order, by

We want the change L(θ)Δθ\nabla L(\theta)^\top \Delta\theta to be as negative as possible. Substituting the gradient-descent step Δθ=ηL(θ)\Delta\theta = -\eta\,\nabla L(\theta),

The four learning-rate regimes

For our quadratic the update factor is 12η1 - 2\eta, and 12η<1|1 - 2\eta| < 1 holds exactly when 0<η<10 < \eta < 1. Sweeping η\eta traces out four qualitatively different behaviors that recur — messier, but recognizable — in real training:

  • Convergence (good η\eta). Steady, monotone approach to the minimum. Here 0<η<0.50 < \eta < 0.5: each step moves the same direction, shrinking. This is the target.
  • Slow (too small η\eta). Also converges, but crawls — η=0.01\eta = 0.01 needs hundreds of steps to cover what a good rate does in ten. Correct, wasteful.
  • Oscillation (borderline large η\eta). Steps overshoot the minimum and bounce to the other side, either decaying slowly (0.5<η<10.5 < \eta < 1) or, at η=1\eta = 1, bouncing forever between +4+4 and 4-4 without progress.
  • Divergence (too large η\eta). For η>1\eta > 1 every step overshoots by more than it started; x|x| grows geometrically and the loss explodes to infinity — in real networks this surfaces as NaN.

ML use case: training is gradient descent, tuning is choosing η\eta

When you call .fit() or run a training loop, this is literally what happens under the hood. The loss is (say) mean squared error or cross-entropy over a batch of data; θ\theta is the millions or billions of network weights; the gradient L\nabla L is computed by backpropagation (the chain rule from the previous chapters, applied across the whole computational graph); and each optimizer step is a variant of θθηL\theta \leftarrow \theta - \eta\,\nabla L. Modern optimizers like Adam decorate this with momentum and per-parameter scaling, but the skeleton is unchanged.

Which means the learning rate η\eta is not an incidental knob — it is the hinge the whole procedure swings on. Set it well and training converges in reasonable time; set it a few times too large and the loss diverges to NaN in the first few steps; set it too small and you burn a week of GPU time to half-train a model. Practitioners spend real effort on learning-rate schedules (warm up, then decay) precisely because equation 16.3's "small enough η\eta" is a moving target as the landscape's curvature changes during training.

NumPy: gradient descent across the four regimes

Let us make the regimes concrete. We minimize L(x)=x2L(x) = x^2 from x0=4x_0 = 4 for a sweep of learning rates, print the final position and the governing factor 12η|1 - 2\eta|, and assert that a good rate converges to the minimum while a large one diverges.

gradient_descent_1d.py

Read the printed trajectory line by line. At η=0.01\eta = 0.01 the factor is 0.980.98 and xx barely moves — the slow regime. At η=0.1\eta = 0.1 and 0.50.5 the factor is below one and xx lands essentially on 00 — clean convergence. At η=1.0\eta = 1.0 the factor is exactly 11 and xx ends right back near ±4\pm 4 — perpetual oscillation. At η=1.5\eta = 1.5 the factor is 22 and xx has ballooned past a thousand — divergence. One function, four fates, decided by a single scalar.

Non-convex, yet it works

Here is the honest tension. The clean guarantees — one global minimum, descent provably reaches it — hold only for convex losses. Deep networks are wildly non-convex: their loss surfaces have astronomically many critical points. By the theory we should expect to get stuck. In practice, large networks train remarkably well. Why?

The current understanding, briefly and honestly: in very high dimensions, bad local minima turn out to be rare — most critical points are saddle points, not minima, and the many local minima that do exist tend to have losses close to the global best, so landing in one is usually fine. Stochasticity (using a random mini-batch each step) adds noise that helps escape saddles and shallow traps. None of this is a theorem for real networks; it is empirical pattern plus partial theory. The practical upshot is liberating: you do not need to find the global optimum to get a useful model — you just need to get somewhere good enough, and plain gradient descent, run at a sensible learning rate, reliably does.

Research-paper equation practice

Summary

  • Learning is optimization. Training a model means minimizing a loss L(θ)L(\theta): find θ\*=argminθL(θ)\theta^\* = \arg\min_\theta L(\theta).
  • Critical points and tests. Extrema satisfy f(x)=0f'(x) = 0 (first-derivative test); f>0f'' > 0 marks a minimum and f<0f'' < 0 a maximum (second-derivative test), the sign of ff'' being curvature.
  • Gradient descent iterates θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta\,\nabla L(\theta_t), stepping along the steepest-descent direction L-\nabla L. To first order each step changes the loss by ηL20-\eta\lVert\nabla L\rVert^2 \le 0, so a small enough η\eta guarantees a decrease.
  • The learning rate η\eta sets the regime: good η\eta converges, too-small η\eta crawls, borderline η\eta oscillates, too-large η\eta diverges (NaN). It is the single most important hyperparameter.
  • Convex vs non-convex. A convex loss is one bowl — descent provably reaches its global minimum. Deep learning is non-convex, with local minima and saddle points, yet trains well in practice because bad minima are rare in high dimensions and "good enough" is enough.

Active recall

Answer from memory before checking the lesson:

  1. Write the gradient-descent update rule and name every symbol in it.
  2. State the first- and second-derivative tests for a local minimum of f(x)f(x).
  3. Using the Taylor expansion, explain in one sentence why stepping along L-\nabla L decreases the loss for small η\eta.
  4. Name the four learning-rate regimes and what happens in each.
  5. Deep learning is non-convex, so why does gradient descent work anyway? Give the honest, one-paragraph answer.

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge