Powers, Roots, Exponentials, and Logarithms
The machinery behind loss functions and learning rates
Prerequisites
Learning objectives
- Apply the laws of exponents and roots fluently
- Explain why e and ln are the 'natural' choices in ML
- Convert between exponential and logarithmic form
- Use log-space reasoning for products of probabilities
Why exponentials and logs run underneath ML
Two functions show up on almost every page of a machine-learning derivation, and they are inverses of each other. The exponential turns sums into products and appears the instant you write a softmax, a Gaussian, or a growth process. The logarithm runs the machine in reverse: it turns products into sums, and that single algebraic fact is why we train models by maximizing log-likelihood instead of likelihood, why loss functions are cross-entropies, and why a pipeline that multiplies ten thousand probabilities does not silently collapse to zero.
Before any of that, though, both functions are just the top of a short ladder you already half-know from a scientific calculator:
- Powers — repeated multiplication, .
- Roots — powers with fractional exponents, .
- Exponentials — a fixed base raised to a variable, , with the natural choice.
- Logarithms — the inverse question: " to what power gives ?"
This chapter climbs that ladder and stops at the one identity you will reuse more than any other: .
Intuition: logs turn "times" into "plus"
Here is the whole idea in one sentence. Multiplying is expensive and, on a computer, dangerous — multiply enough small numbers and the result rounds to exactly zero. Adding is cheap and safe. A logarithm is the dictionary that translates every multiplication into an addition:
The exponential is the same dictionary read backwards: it translates addition back into multiplication, . That is all an exponential and a logarithm do — they are a matched pair of translators between the additive world and the multiplicative world. Everything else in this chapter is a consequence of that one relationship.
The base is the "natural" choice for a reason we will make precise: it is the base whose exponential is its own rate of change, so calculus (and therefore gradient-based learning) stays clean. Before the formalities, get a feel for the two curves by dragging on them.
Select exponential and notice it never dips to zero or below — for every real — and that it explodes for positive . Now select logarithm: it is the mirror image across the line , defined only for , crossing the axis at and diving to as . That last fact is the reason "the log of a tiny probability is a large negative number," which is exactly how a loss function should behave.
Formal definitions
From that single definition, three laws of exponents follow by just counting factors:
Fractional exponents are defined to make the third law keep working. If we want , then must be the number that gives when raised to the -th power — that is, the -th root.
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| Base b raised to power n | scalar | 1 | variable | |
| n-th root of b (fractional power) | scalar | 1 | variable | |
| Euler's number, approx 2.71828 (the natural base) | scalar | 1 | fixed | |
| The natural exponential | function | 1 | fixed | |
| Natural logarithm (inverse of exp) | function | 1 | fixed | |
| Logarithm base b: the power giving x | function | 1 | fixed |
Why is the "natural" base
Among all bases , exactly one makes the exponential its own derivative: . For any other base, picks up a stray constant factor . Choosing sets that factor to , so growth rate equals value. Equivalently, is the limit of compounding infinitely often, which is why continuously-compounded interest, radioactive decay, and exponential learning-rate schedules are all written with . When derivatives stay clean, so do gradients — and gradients are how models learn.
The log rules
Because is the inverse of a function that turns sums into products, it turns products back into sums. The three rules you must know cold:
To move between bases (e.g. from to ), use change of base:
These hold for any valid base, not just ; the product, quotient, and power rules are the same shape whether you write , , or .
A worked numerical example
Deriving a log rule from an exponent law
The log rules are not new facts; each is an exponent law from eq. 3.1 seen through the inverse. Here is the product rule, derived from .
ML use case: log-likelihood and avoiding underflow
Suppose a model assigns probability to the correct label of the -th example, and the examples are independent. The probability of getting the whole dataset right — the likelihood — is the product
Each lies in , so this product shrinks fast. With examples each around , we get , which is far below the smallest positive number a 64-bit float can represent (about ). The computed likelihood underflows to exactly , and destroys the gradient. Multiplication is the enemy.
Take the logarithm and the product rule (eq. 3.2) turns it into a sum — the log-likelihood:
Now each term is a modest negative number (for , ), and their sum is perfectly representable. This is why training objectives are written in log-space:
- Cross-entropy loss is exactly the negative log-likelihood, (averaged), so minimizing it maximizes the probability of the data.
- Softmax produces those as ; its denominator is a sum of exponentials, and taking of the whole thing (the log-sum-exp operation, ) is done with a numerically stable trick — subtract first so no overflows.
- Learning-rate decay often uses the exponential , a smooth multiplicative shrink that is a straight line in log-space.
The through-line: products of probabilities become sums of log-probabilities, and that conversion is both algebraically convenient (sums differentiate term by term) and numerically essential (sums do not underflow).
NumPy: products vs sums in log-space
Let us watch the underflow happen and then fix it with the product rule. Run this:
The np.prod path collapses to 0.0 because the true value is astronomically
small; np.sum(np.log(p)) computes the same quantity in log-space, where it is a
finite negative number you can actually optimize. In real code you would go one
step further and use np.logaddexp / a log-sum-exp helper whenever you must
combine log-space terms that came from a sum rather than a product.
Summary
- Powers are repeated multiplication; the three exponent laws (, , ) follow by counting factors, and roots are just fractional powers, .
- The exponential and natural log are inverses; is "natural" because , keeping gradients clean.
- The log rules turn products into sums, quotients into differences, and powers into multipliers; change of base is . Each rule is an exponent law seen through the inverse.
- Exponential log form: .
- In ML, likelihood is a product of probabilities that underflows; taking turns it into a sum of log-probabilities (cross-entropy / log-likelihood), which is both differentiable term-by-term and numerically safe. Softmax denominators use log-sum-exp; learning-rate decay uses .
Active recall
Answer from memory before checking the lesson:
- State the three laws of exponents and rewrite as a single power of .
- What is ? What is , and what does approach as ?
- Derive starting from an exponent law.
- A pipeline multiplies probabilities and the result prints as
0.0. What went wrong, and what one-line change fixes it? - Convert into an expression using only .
Exercises
Level ARecall & basic calculation
Evaluate a logarithm
Evaluate . (Read it as: ' to what power gives ?')
Product law of exponents
Use to evaluate as a single integer.
Root as a fractional power
Write as a single power . Enter the exponent as a decimal.
A base-10 logarithm
Evaluate .
exp and ln are inverses
Evaluate .
Negative exponent
Evaluate as a decimal.
Level BConceptual understanding
Exponential to log form
The statement is written in exponential form. Which is its correct logarithmic form?
Spot the invalid log identity
Which of the following is not a valid logarithm identity?
Change of base, numerically
Using change of base with and , evaluate . Give three decimals.
Why maximize log-likelihood?
Training maximizes the log-likelihood rather than the raw likelihood . Give two distinct reasons this substitution is safe (same optimum) and better.
Reading the log-likelihood equation
A model reports a total log-likelihood of over a dataset. Which statement is the best interpretation?
Level CDerivation & implementation
Implement a stable log-likelihood
Write log_likelihood(p) that returns for a 1-D array of probabilities. Show that for tiny probabilities the naive product np.prod(p) underflows to 0.0 while your log-space sum stays finite, then print ok.
Derive the quotient rule
Derive the quotient rule starting from an exponent law, for .
Numerically stable log-sum-exp
Implement logsumexp(z) computing with the max-subtraction trick, and show it agrees with the naive formula on safe inputs but does not overflow on large logits like z = [1000, 1001, 1002]. Print ok.
Level DResearch-thinking challenge
Why subtract the max in log-sum-exp?
The log-sum-exp identity holds for any constant . Prove the identity algebraically, then explain why the specific choice is the numerically safe one — addressing both overflow and underflow.
Cross-entropy as negative log-likelihood
For a classifier that outputs probability on the correct class of example , the average cross-entropy loss is . Explain why minimizing is the same as maximizing the likelihood , and interpret the per-example loss as an amount of 'surprise' — including what happens as and as .