Optimization Intuition
Minima, gradient descent, learning rates, and convexity
Prerequisites
Learning objectives
- Characterize minima/maxima with first and second derivatives
- State and run the gradient-descent update rule
- Diagnose convergence, slow learning, oscillation, and divergence
- Explain why convexity makes optimization easy
Why every model is a minimization problem
Training a machine-learning model sounds like a bespoke, almost magical process: show it data, and somehow it "learns." Strip away the mystique and what remains is startlingly uniform. You write down a single number that says how wrong the model currently is — the loss — and then you nudge the model's parameters, over and over, in whatever direction makes that number go down. That is the whole game.
Linear regression, logistic regression, a 400-billion-parameter transformer: all three are the same sentence with different nouns. Each defines a loss function of its parameters , and each is trained by minimizing it. So the single most useful thing you can carry out of the calculus we have built is this: learning is optimization. This chapter is where derivatives stop being an exercise and become the engine.
Our goal is intuition first, then the one algorithm — gradient descent — that powers essentially all of modern deep learning, and an honest account of when it works, when it stalls, and when it explodes.
Intuition: rolling downhill in the fog
Picture the loss as a landscape. The horizontal position is your parameter vector ; the height above each point is the loss . Training means finding the lowest valley. Now add one cruel constraint: it is foggy, and you can only feel the slope under your feet — you cannot see the whole terrain.
What would you do? You would feel which way is downhill and take a step that way. Then feel again, step again. The slope is exactly the gradient , and "downhill" is its negative, . The size of each step is the learning rate . Too timid and you inch along for hours; too bold and you leap clear over the valley and land higher up the far wall. The entire art of training lives in that tension.
Drag the starting point and the learning-rate slider above. Watch the ball. With a sensible step it settles into the valley in a few bounces; crank up and it starts flinging itself outward, further every step. Hold both pictures — the smooth bowl and the runaway ball — in your head as we make them precise.
Formal definitions
The two tests you already own from single-variable calculus decide the matter. The first-derivative test locates candidates: solve . The second-derivative test classifies them:
The sign of is curvature: means the graph curves up like a bowl (a minimum sits at the bottom), means it curves down like a dome. When the test is inconclusive.
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| Loss / objective — how wrong the model is | function | \mathbb{R}^n\to\mathbb{R} | objective | |
| Parameter vector being optimized | vector | n×1 | variable | |
| Gradient — direction of steepest ascent | vector | n×1 | derivative | |
| Learning rate (step size), a positive scalar | scalar | 1 | hyperparameter | |
| A minimizer, the arg-min of L | vector | n×1 | solution | |
| Second derivative — curvature | scalar | 1 | test |
Worked example: minimizing by hand
Take the simplest possible bowl, . Its derivative is , so the only critical point is . The second derivative is everywhere, so that critical point is a minimum — as we knew. Now let gradient descent rediscover it.
Because the map is , the entire behavior is governed by the single number : strictly less than and shrinks to zero; equal to and it never settles; greater than and it grows without bound. We will see all three in the code.
Why is the right direction
Why step against the gradient rather than along it, or sideways? Because of the first-order Taylor expansion. Move from by a small displacement ; the loss changes, to first order, by
We want the change to be as negative as possible. Substituting the gradient-descent step ,
The four learning-rate regimes
For our quadratic the update factor is , and holds exactly when . Sweeping traces out four qualitatively different behaviors that recur — messier, but recognizable — in real training:
- Convergence (good ). Steady, monotone approach to the minimum. Here : each step moves the same direction, shrinking. This is the target.
- Slow (too small ). Also converges, but crawls — needs hundreds of steps to cover what a good rate does in ten. Correct, wasteful.
- Oscillation (borderline large ). Steps overshoot the minimum and bounce to the other side, either decaying slowly () or, at , bouncing forever between and without progress.
- Divergence (too large ). For every step overshoots by more
than it started; grows geometrically and the loss explodes to infinity — in
real networks this surfaces as
NaN.
ML use case: training is gradient descent, tuning is choosing
When you call .fit() or run a training loop, this is literally what happens under
the hood. The loss is (say) mean squared error or cross-entropy over a batch of data;
is the millions or billions of network weights; the gradient is
computed by backpropagation (the chain rule from the previous chapters, applied
across the whole computational graph); and each optimizer step is a variant of
. Modern optimizers like Adam decorate this
with momentum and per-parameter scaling, but the skeleton is unchanged.
Which means the learning rate is not an incidental knob — it is the hinge the
whole procedure swings on. Set it well and training converges in reasonable time; set
it a few times too large and the loss diverges to NaN in the first few steps; set
it too small and you burn a week of GPU time to half-train a model. Practitioners
spend real effort on learning-rate schedules (warm up, then decay) precisely
because equation 16.3's "small enough " is a moving target as the landscape's
curvature changes during training.
NumPy: gradient descent across the four regimes
Let us make the regimes concrete. We minimize from for a sweep of learning rates, print the final position and the governing factor , and assert that a good rate converges to the minimum while a large one diverges.
Read the printed trajectory line by line. At the factor is and barely moves — the slow regime. At and the factor is below one and lands essentially on — clean convergence. At the factor is exactly and ends right back near — perpetual oscillation. At the factor is and has ballooned past a thousand — divergence. One function, four fates, decided by a single scalar.
Non-convex, yet it works
Here is the honest tension. The clean guarantees — one global minimum, descent provably reaches it — hold only for convex losses. Deep networks are wildly non-convex: their loss surfaces have astronomically many critical points. By the theory we should expect to get stuck. In practice, large networks train remarkably well. Why?
The current understanding, briefly and honestly: in very high dimensions, bad local minima turn out to be rare — most critical points are saddle points, not minima, and the many local minima that do exist tend to have losses close to the global best, so landing in one is usually fine. Stochasticity (using a random mini-batch each step) adds noise that helps escape saddles and shallow traps. None of this is a theorem for real networks; it is empirical pattern plus partial theory. The practical upshot is liberating: you do not need to find the global optimum to get a useful model — you just need to get somewhere good enough, and plain gradient descent, run at a sensible learning rate, reliably does.
Research-paper equation practice
The gradient-descent update
The single line at the heart of training every neural network. You will see it, or a momentum/Adam variant of it, in essentially every optimization paper.
Work through these steps:
- Identify every symbol.
- State the type of every object (scalar, vector, matrix, index, set, function).
- State the dimensions / shapes.
- Rewrite the equation in plain English.
- Expand it for a tiny concrete example.
- Identify the assumptions.
- Convert it to pseudocode.
- Implement it in NumPy.
- Explain its machine-learning purpose.
The minimization objective
The problem statement that opens most machine-learning method sections: 'we train by minimizing the following objective.'
Work through these steps:
- Identify every symbol.
- State the type of every object (scalar, vector, matrix, index, set, function).
- State the dimensions / shapes.
- Rewrite the equation in plain English.
- Expand it for a tiny concrete example.
- Identify the assumptions.
- Convert it to pseudocode.
- Implement it in NumPy.
- Explain its machine-learning purpose.
Summary
- Learning is optimization. Training a model means minimizing a loss : find .
- Critical points and tests. Extrema satisfy (first-derivative test); marks a minimum and a maximum (second-derivative test), the sign of being curvature.
- Gradient descent iterates , stepping along the steepest-descent direction . To first order each step changes the loss by , so a small enough guarantees a decrease.
- The learning rate sets the regime: good converges, too-small
crawls, borderline oscillates, too-large diverges (
NaN). It is the single most important hyperparameter. - Convex vs non-convex. A convex loss is one bowl — descent provably reaches its global minimum. Deep learning is non-convex, with local minima and saddle points, yet trains well in practice because bad minima are rare in high dimensions and "good enough" is enough.
Active recall
Answer from memory before checking the lesson:
- Write the gradient-descent update rule and name every symbol in it.
- State the first- and second-derivative tests for a local minimum of .
- Using the Taylor expansion, explain in one sentence why stepping along decreases the loss for small .
- Name the four learning-rate regimes and what happens in each.
- Deep learning is non-convex, so why does gradient descent work anyway? Give the honest, one-paragraph answer.
Exercises
Level ARecall & basic calculation
Find the critical point
Find the critical point of by solving .
One gradient-descent step by hand
Minimize with gradient descent. Starting from with learning rate , compute the value of after one update .
Classify with the second derivative
At a critical point a function satisfies and . What kind of point is it?
The update factor for a quadratic
For , one gradient-descent step is . Compute the multiplicative factor for .
Name the learning rate
In the gradient-descent update , which symbol is the learning rate?
Evaluate a gradient
For the loss , compute the gradient at .
Level BConceptual understanding
Diagnose a diverging run
You train a model and the loss increases every step until it prints NaN. Of the following, which single change is the most likely fix?
What convexity guarantees
The loss is convex (a single bowl). Which statement is true about running gradient descent with a small enough learning rate?
Why $f' = 0$ is not enough
Gradient descent slows to a near-stop because the gradient is almost zero, yet the loss is still high. Which explanation is consistent with this?
The too-small learning rate
A colleague sets 'to be safe' and reports that after a full day of training the loss has barely moved, though it is slowly decreasing. In one or two sentences, explain what regime this is and the practical trade-off of a very small .
Why the minus sign?
The update is — note the minus sign. Why do we step along rather than ?
Level CDerivation & implementation
Implement 1-D gradient descent
Write gradient_descent(grad, x0, eta, steps) that runs gradient descent in 1-D and returns the final . Use it to minimize (whose gradient is ) starting from with for 200 steps, assert the result is within of , then print ok.
Derive the convergence condition
For , gradient descent is . Derive the exact range of learning rates for which , and identify the value of that converges fastest.
Minimizer of a general quadratic
Using the first- and second-derivative tests, derive the location of the minimum of with , and confirm it is a minimum.
Experiment: convergence vs divergence
Minimize from . Run gradient descent for 20 steps with and again with . Print each final , assert that the first converges (near ) while the second diverges (magnitude above ), then print ok.
Level DResearch-thinking challenge
Non-convex, yet it works
Classical optimization theory guarantees gradient descent reaches the global minimum only for convex losses. Deep-network losses are highly non-convex, so we should expect descent to get trapped in bad local minima — yet in practice large models train well. Give the honest, current explanation for why, and state one concrete consequence for how practitioners think about training (e.g. what they do or do not worry about).
Why saddles, not minima, dominate in high dimensions
A critical point () is a local minimum only if the curvature is positive in every direction. Using this fact, argue heuristically why, as the number of parameters grows, a random critical point is far more likely to be a saddle point than a local minimum. Then explain why this makes escaping such points a matter of finding a downhill direction rather than being truly stuck.