Part 2 · Functions and GraphsChapter 670 min

Important Functions in Machine Learning

Linear, quadratic, sigmoid, tanh, ReLU, softplus

Learning objectives

  • Recognize the shape, domain, range, and asymptotes of each activation
  • Explain what problem each nonlinearity solves in a network
  • Relate sigmoid, tanh, and softplus algebraically
  • Anticipate saturation and vanishing-gradient behavior from a graph

Why a network needs the right functions

A neuron, as we saw, computes a dot product plus a bias: z=wx+bz = \mathbf{w}^\top\mathbf{x} + b. That is a linear function of its input. Stack two linear layers and you get… another linear function — the composition of linear maps is still linear. No matter how many layers you pile up, a purely linear network can only draw a straight decision boundary. It cannot learn XOR, cannot bend a curve, cannot separate two interleaved spirals.

The fix is to insert a nonlinear function after each linear step. That single scalar function — applied element-wise to every neuron's pre-activation zz — is what lets a deep network represent curved, non-linear relationships. These are the activation functions, and this chapter is a tour of the handful that matter: sigmoid, tanh, ReLU, Leaky ReLU, and softplus. For each one we care about the same short checklist: its shape, its domain and range, where it saturates, what problem it solves, and — most important for training — how its gradient behaves.

We start from two functions you already know, then build up.

Intuition: squashers vs. rectifiers

Two families cover almost everything in practice.

  • Squashers (sigmoid, tanh) take the whole real line and crush it into a bounded interval — (0,1)(0,1) for sigmoid, (1,1)(-1,1) for tanh. They are smooth S-curves. The price of that bounded output is saturation: far out in either tail the curve goes flat, so its slope — the gradient — collapses toward zero.
  • Rectifiers (ReLU and friends) leave positive inputs untouched and clamp negatives. They do not saturate on the positive side, so their gradient stays a healthy 11 no matter how large the input — but they pay with a kink at the origin and a dead flat region for negative inputs.

Keep that trade-off in mind — bounded and smooth but saturating versus unbounded and non-saturating but kinked — and every choice below will make sense.

Interactive LabFunction Explorer
Loading interactive lab…

Select each function in the explorer and watch two things: the curve itself, and the derivative it plots underneath. The activation is only half the story; the derivative is what the network trains on. Where the derivative is near zero, learning stalls.

Formal definitions

Recall the two baselines from the function chapter. A linear function f(x)=ax+bf(x) = ax + b has a constant slope aa and range R\mathbb{R}; a quadratic f(x)=ax2+bx+cf(x) = ax^2 + bx + c is the prototype convex bowl (for a>0a > 0) with a single minimum. Activations are the nonlinear layer we add on top.

Numerical example: reading a sigmoid off the number line

Plug three inputs into σ\sigma to feel the saturation directly.

Derivations

Sigmoid and tanh are the same curve, rescaled

The two squashers are not independent functions — one is an affine rescaling of the other.

Softplus is the integral of the sigmoid

The claim ζ(x)=σ(x)\zeta'(x) = \sigma(x) is worth doing by hand because it explains the whole design of softplus.

ML use case: nonlinearity, saturation, and why ReLU won

Nonlinearity is the point. A layer computes a=g(Wx+b)\mathbf{a} = g(\mathbf{W}\mathbf{x} + \mathbf{b}) where gg is the activation applied element-wise. Remove gg (or make it linear) and a hundred-layer network algebraically collapses to a single linear map. The activation is precisely what buys representational power — the ability to approximate curved functions and carve non-linear decision boundaries.

Saturation kills gradients. During backpropagation the update to an early weight is a product of the derivatives of every activation along the path (the chain rule, next chapter). Sigmoid's derivative never exceeds 0.250.25, and out in the tails it is essentially 00. Multiply a dozen numbers each 0.25\le 0.25 together and the signal reaching the first layer is astronomically small — the vanishing gradient problem that made deep sigmoid/tanh networks nearly untrainable.

ReLU is the default because it does not saturate on the positive side. For any x>0x > 0 its derivative is exactly 11, so gradients pass through undiminished no matter how deep the stack. It is also trivially cheap — a single comparison. The costs are the kink at 00 (a subgradient handles it) and dying units: a neuron stuck at x<0x < 0 has gradient 00 and may never recover. Leaky ReLU's small negative slope α\alpha and the smooth softplus both exist to soften that failure mode. In practice ReLU remains the first thing you reach for in a hidden layer, while sigmoid survives at the output of a binary classifier, where its (0,1)(0,1) range is read as a probability.

NumPy implementation

Let us implement all four core activations in a vectorized way — one call evaluates the whole np.linspace grid at once — and then verify the tanh identity numerically. The one subtlety is softplus: the naive np.log(1 + np.exp(x)) overflows for large x, so we use the numerically stable np.logaddexp(0, x), which computes ln(e0+ex)=ln(1+ex)\ln(e^0 + e^x) = \ln(1 + e^x) without ever forming exe^x directly.

activations.py

Every assertion here is a fact from the sections above turned into a runnable check: sigmoid saturates in the tails, ReLU zeros the negatives, tanh is a rescaled sigmoid, and softplus differentiates to the sigmoid. When your math and your code agree on a grid of points, you can trust both.

Part II checkpoint — read equations like a researcher

You will meet these two equations constantly in papers. Work each one through the nine-step routine before revealing the solution.

Summary

  • A network needs nonlinear activations between linear layers; without them, any depth collapses to a single linear map and cannot model curved relationships.
  • Sigmoid σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x}) maps R(0,1)\mathbb{R} \to (0,1), S-shaped, σ(0)=0.5\sigma(0)=0.5; derivative σ(1σ)0.25\sigma(1-\sigma) \le 0.25 → saturates and causes vanishing gradients.
  • Tanh maps to (1,1)(-1,1), is zero-centered, and equals a rescaled sigmoid: tanh(x)=2σ(2x)1\tanh(x) = 2\sigma(2x) - 1. Also saturates.
  • ReLU =max(0,x)=\max(0,x): unbounded above, non-saturating for x>0x > 0 (gradient 11), a kink at 00, and a dead zone for x<0x < 0 — the deep-learning default.
  • Leaky ReLU gives negatives a small slope α\alpha to avoid dying units; softplus ζ(x)=ln(1+ex)\zeta(x)=\ln(1+e^{x}) is a smooth ReLU whose derivative is exactly σ(x)\sigma(x).
  • Implement activations vectorized; compute softplus with np.logaddexp(0, x) to avoid overflow.

Active recall

Answer from memory before checking the lesson:

  1. Why can't a network of only linear layers (no activation) learn a curved decision boundary?
  2. State the range of sigmoid, tanh, and ReLU. Which one is zero-centered?
  3. What is σ(0)\sigma(0), and what is the maximum value of σ(x)\sigma'(x)? Where does it occur?
  4. Explain in one sentence why saturated sigmoids cause vanishing gradients in a deep network.
  5. What single advantage makes ReLU the default hidden activation, and what is its main failure mode?
  6. Prove that ζ(x)=σ(x)\zeta'(x) = \sigma(x) for softplus ζ(x)=ln(1+ex)\zeta(x) = \ln(1+e^{x}).

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge

Related lessons