Part 1 · Algebra and Mathematical NotationChapter 365 min

Powers, Roots, Exponentials, and Logarithms

The machinery behind loss functions and learning rates

Learning objectives

  • Apply the laws of exponents and roots fluently
  • Explain why e and ln are the 'natural' choices in ML
  • Convert between exponential and logarithmic form
  • Use log-space reasoning for products of probabilities

Why exponentials and logs run underneath ML

Two functions show up on almost every page of a machine-learning derivation, and they are inverses of each other. The exponential exe^x turns sums into products and appears the instant you write a softmax, a Gaussian, or a growth process. The logarithm lnx\ln x runs the machine in reverse: it turns products into sums, and that single algebraic fact is why we train models by maximizing log-likelihood instead of likelihood, why loss functions are cross-entropies, and why a pipeline that multiplies ten thousand probabilities does not silently collapse to zero.

Before any of that, though, both functions are just the top of a short ladder you already half-know from a scientific calculator:

  • Powers — repeated multiplication, bnb^n.
  • Roots — powers with fractional exponents, x=x1/2\sqrt{x} = x^{1/2}.
  • Exponentials — a fixed base raised to a variable, bxb^x, with exe^x the natural choice.
  • Logarithms — the inverse question: "bb to what power gives xx?"

This chapter climbs that ladder and stops at the one identity you will reuse more than any other: log(ab)=loga+logb\log(ab) = \log a + \log b.

Intuition: logs turn "times" into "plus"

Here is the whole idea in one sentence. Multiplying is expensive and, on a computer, dangerous — multiply enough small numbers and the result rounds to exactly zero. Adding is cheap and safe. A logarithm is the dictionary that translates every multiplication into an addition:

a×b×ca product    log    loga+logb+logca sum.\underbrace{a \times b \times c}_{\text{a product}} \;\xrightarrow{\;\log\;}\; \underbrace{\log a + \log b + \log c}_{\text{a sum}}.

The exponential is the same dictionary read backwards: it translates addition back into multiplication, ep+q=epeqe^{p+q} = e^p\, e^q. That is all an exponential and a logarithm do — they are a matched pair of translators between the additive world and the multiplicative world. Everything else in this chapter is a consequence of that one relationship.

The base e2.71828e \approx 2.71828 is the "natural" choice for a reason we will make precise: it is the base whose exponential is its own rate of change, so calculus (and therefore gradient-based learning) stays clean. Before the formalities, get a feel for the two curves by dragging on them.

Interactive LabFunction Explorer
Loading interactive lab…

Select exponential and notice it never dips to zero or below — ex>0e^x > 0 for every real xx — and that it explodes for positive xx. Now select logarithm: it is the mirror image across the line y=xy = x, defined only for x>0x > 0, crossing the axis at ln1=0\ln 1 = 0 and diving to -\infty as x0+x \to 0^+. That last fact is the reason "the log of a tiny probability is a large negative number," which is exactly how a loss function should behave.

Formal definitions

From that single definition, three laws of exponents follow by just counting factors:

Fractional exponents are defined to make the third law keep working. If we want (b1/n)n=b1=b\left(b^{1/n}\right)^{n} = b^{1} = b, then b1/nb^{1/n} must be the number that gives bb when raised to the nn-th power — that is, the nn-th root.

Why ee is the "natural" base

Among all bases bb, exactly one makes the exponential its own derivative: ddxex=ex\frac{d}{dx} e^{x} = e^{x}. For any other base, ddxbx=bxlnb\frac{d}{dx} b^{x} = b^{x}\ln b picks up a stray constant factor lnb\ln b. Choosing b=eb = e sets that factor to lne=1\ln e = 1, so growth rate equals value. Equivalently, ee is the limit of compounding infinitely often, e=limn(1+1n)n,e = \lim_{n \to \infty}\left(1 + \tfrac{1}{n}\right)^{n}, which is why continuously-compounded interest, radioactive decay, and exponential learning-rate schedules are all written with ee. When derivatives stay clean, so do gradients — and gradients are how models learn.

The log rules

Because ln\ln is the inverse of a function that turns sums into products, it turns products back into sums. The three rules you must know cold:

To move between bases (e.g. from ln\ln to log2\log_{2}), use change of base:

These hold for any valid base, not just ee; the product, quotient, and power rules are the same shape whether you write log2\log_{2}, log10\log_{10}, or ln\ln.

A worked numerical example

Deriving a log rule from an exponent law

The log rules are not new facts; each is an exponent law from eq. 3.1 seen through the inverse. Here is the product rule, derived from bmbn=bm+nb^{m}b^{n} = b^{m+n}.

ML use case: log-likelihood and avoiding underflow

Suppose a model assigns probability pip_i to the correct label of the ii-th example, and the NN examples are independent. The probability of getting the whole dataset right — the likelihood — is the product

Each pip_i lies in (0,1)(0, 1), so this product shrinks fast. With N=10,000N = 10{,}000 examples each around pi=0.1p_i = 0.1, we get L=1010000L = 10^{-10000}, which is far below the smallest positive number a 64-bit float can represent (about 1030810^{-308}). The computed likelihood underflows to exactly 0.00.0, and ln0=\ln 0 = -\infty destroys the gradient. Multiplication is the enemy.

Take the logarithm and the product rule (eq. 3.2) turns it into a sum — the log-likelihood:

Now each term lnpi\ln p_i is a modest negative number (for pi=0.1p_i = 0.1, lnpi2.3\ln p_i \approx -2.3), and their sum 23,000\approx -23{,}000 is perfectly representable. This is why training objectives are written in log-space:

  • Cross-entropy loss is exactly the negative log-likelihood, ilnpi-\sum_i \ln p_i (averaged), so minimizing it maximizes the probability of the data.
  • Softmax produces those pip_i as ezj/kezke^{z_j}/\sum_k e^{z_k}; its denominator is a sum of exponentials, and taking ln\ln of the whole thing (the log-sum-exp operation, lnkezk\ln\sum_k e^{z_k}) is done with a numerically stable trick — subtract maxkzk\max_k z_k first so no ezke^{z_k} overflows.
  • Learning-rate decay often uses the exponential ηt=η0eλt\eta_t = \eta_0\, e^{-\lambda t}, a smooth multiplicative shrink that is a straight line in log-space.

The through-line: products of probabilities become sums of log-probabilities, and that conversion is both algebraically convenient (sums differentiate term by term) and numerically essential (sums do not underflow).

NumPy: products vs sums in log-space

Let us watch the underflow happen and then fix it with the product rule. Run this:

log_space.py

The np.prod path collapses to 0.0 because the true value is astronomically small; np.sum(np.log(p)) computes the same quantity in log-space, where it is a finite negative number you can actually optimize. In real code you would go one step further and use np.logaddexp / a log-sum-exp helper whenever you must combine log-space terms that came from a sum rather than a product.

Summary

  • Powers are repeated multiplication; the three exponent laws (bmbn=bm+nb^{m}b^{n}=b^{m+n}, bm/bn=bmnb^{m}/b^{n}=b^{m-n}, (bm)n=bmn(b^{m})^{n}=b^{mn}) follow by counting factors, and roots are just fractional powers, bn=b1/n\sqrt[n]{b}=b^{1/n}.
  • The exponential exe^{x} and natural log lnx\ln x are inverses; ee is "natural" because ddxex=ex\frac{d}{dx}e^{x}=e^{x}, keeping gradients clean.
  • The log rules turn products into sums, quotients into differences, and powers into multipliers; change of base is logbx=lnx/lnb\log_{b}x = \ln x / \ln b. Each rule is an exponent law seen through the inverse.
  • Exponential \leftrightarrow log form: y=bx    x=logbyy=b^{x} \iff x=\log_{b}y.
  • In ML, likelihood is a product of probabilities that underflows; taking ln\ln turns it into a sum of log-probabilities (cross-entropy / log-likelihood), which is both differentiable term-by-term and numerically safe. Softmax denominators use log-sum-exp; learning-rate decay uses eλte^{-\lambda t}.

Active recall

Answer from memory before checking the lesson:

  1. State the three laws of exponents and rewrite x23\sqrt[3]{x^{2}} as a single power of xx.
  2. What is log232\log_{2} 32? What is ln1\ln 1, and what does lnx\ln x approach as x0+x \to 0^{+}?
  3. Derive ln(xy)=lnx+lny\ln(xy) = \ln x + \ln y starting from an exponent law.
  4. A pipeline multiplies 50,00050{,}000 probabilities and the result prints as 0.0. What went wrong, and what one-line change fixes it?
  5. Convert log520\log_{5} 20 into an expression using only ln\ln.

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge