Important Functions in Machine Learning

Why a network needs the right functions

A neuron, as we saw, computes a dot product plus a bias: $z = \mathbf{w}^\top\mathbf{x} + b$ . That is a linear function of its input. Stack two linear layers and you get… another linear function — the composition of linear maps is still linear. No matter how many layers you pile up, a purely linear network can only draw a straight decision boundary. It cannot learn XOR, cannot bend a curve, cannot separate two interleaved spirals.

The fix is to insert a nonlinear function after each linear step. That single scalar function — applied element-wise to every neuron's pre-activation $z$ — is what lets a deep network represent curved, non-linear relationships. These are the activation functions, and this chapter is a tour of the handful that matter: sigmoid, tanh, ReLU, Leaky ReLU, and softplus. For each one we care about the same short checklist: its shape, its domain and range, where it saturates, what problem it solves, and — most important for training — how its gradient behaves.

We start from two functions you already know, then build up.

Intuition: squashers vs. rectifiers

Two families cover almost everything in practice.

Squashers (sigmoid, tanh) take the whole real line and crush it into a bounded interval — $(0,1)$ for sigmoid, $(-1,1)$ for tanh. They are smooth S-curves. The price of that bounded output is saturation: far out in either tail the curve goes flat, so its slope — the gradient — collapses toward zero.
Rectifiers (ReLU and friends) leave positive inputs untouched and clamp negatives. They do not saturate on the positive side, so their gradient stays a healthy $1$ no matter how large the input — but they pay with a kink at the origin and a dead flat region for negative inputs.

Keep that trade-off in mind — bounded and smooth but saturating versus unbounded and non-saturating but kinked — and every choice below will make sense.

Interactive LabFunction Explorer

Loading interactive lab…

Select each function in the explorer and watch two things: the curve itself, and the derivative it plots underneath. The activation is only half the story; the derivative is what the network trains on. Where the derivative is near zero, learning stalls.

Formal definitions

Recall the two baselines from the function chapter. A linear function $f(x) = ax + b$ has a constant slope $a$ and range $\mathbb{R}$ ; a quadratic $f(x) = ax^2 + bx + c$ is the prototype convex bowl (for $a > 0$ ) with a single minimum. Activations are the nonlinear layer we add on top.

Symbol	Meaning	Type	Shape	Role
$\sigma(x)$	Sigmoid / logistic	function	ℝ→(0,1)	activation
$\tanh(x)$	Hyperbolic tangent (zero-centered)	function	ℝ→(−1,1)	activation
$\mathrm{ReLU}(x)$	Rectified linear unit, max(0,x)	function	ℝ→[0,∞)	activation
$\alpha$	Leak slope for negative inputs	scalar	1	hyperparameter
$\zeta(x)$	Softplus, ln(1+eˣ)	function	ℝ→(0,∞)	activation
$z$	Pre-activation wᵀx + b (the input to σ)	scalar	1	variable

Numerical example: reading a sigmoid off the number line

Plug three inputs into $\sigma$ to feel the saturation directly.

Worked Example — evaluating the sigmoid at three points

At the center, $x = 0$ : $\sigma(0) = \frac{1}{1 + e^{0}} = \frac{1}{1 + 1} = \frac{1}{2} = 0.5.$ For a large positive input, $x = 10$ : $\sigma(10) = \frac{1}{1 + e^{-10}} = \frac{1}{1 + 0.0000454} \approx 0.99995.$ For a large negative input, $x = -10$ : $\sigma(-10) = \frac{1}{1 + e^{10}} = \frac{1}{1 + 22026} \approx 0.0000454.$ Now the gradients. Since $\sigma'(x) = \sigma(x)(1-\sigma(x))$ : $\sigma'(0) = 0.5 \times 0.5 = 0.25, \qquad \sigma'(10) \approx 0.99995 \times 0.00005 \approx 5\times10^{-5}.$ At the center the slope is a useful $0.25$ ; ten units out it has all but vanished. That collapse is the story of vanishing gradients.

Derivations

Sigmoid and tanh are the same curve, rescaled

The two squashers are not independent functions — one is an affine rescaling of the other.

tanh(x) = 2σ(2x) − 1

Start from the definitions and expand $2\sigma(2x)$ : $2\sigma(2x) = \frac{2}{1 + e^{-2x}} = \frac{2\,e^{x}}{e^{x} + e^{-x}}.$ Subtract $1 = \dfrac{e^{x} + e^{-x}}{e^{x} + e^{-x}}$ : $2\sigma(2x) - 1 = \frac{2e^{x} - (e^{x} + e^{-x})}{e^{x} + e^{-x}} = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}} = \tanh(x).$ So tanh is just a sigmoid stretched to twice the input sensitivity and shifted to be zero-centered. This is why tanh is often preferred over sigmoid for hidden layers: centering the outputs on $0$ keeps the next layer's inputs from all sharing one sign, which speeds up learning.

Softplus is the integral of the sigmoid

The claim $\zeta'(x) = \sigma(x)$ is worth doing by hand because it explains the whole design of softplus.

d/dx ln(1 + eˣ) = σ(x)

Let $\zeta(x) = \ln(1 + e^{x})$ . By the chain rule, with $u = 1 + e^{x}$ and $\dfrac{d}{dx}\ln u = \dfrac{1}{u}\dfrac{du}{dx}$ : $\zeta'(x) = \frac{1}{1 + e^{x}} \cdot \frac{d}{dx}(1 + e^{x}) = \frac{e^{x}}{1 + e^{x}}.$ Divide top and bottom by $e^{x}$ : $\zeta'(x) = \frac{1}{e^{-x} + 1} = \sigma(x).$ So softplus is a smooth rectifier whose gradient rises smoothly from $0$ (deep negative inputs) to $1$ (deep positive inputs) — exactly ReLU's step derivative, but rounded. Where ReLU's gradient jumps discontinuously at $0$ , softplus passes through $\sigma(0) = 0.5$ .

ML use case: nonlinearity, saturation, and why ReLU won

Nonlinearity is the point. A layer computes $\mathbf{a} = g(\mathbf{W}\mathbf{x} + \mathbf{b})$ where $g$ is the activation applied element-wise. Remove $g$ (or make it linear) and a hundred-layer network algebraically collapses to a single linear map. The activation is precisely what buys representational power — the ability to approximate curved functions and carve non-linear decision boundaries.

Saturation kills gradients. During backpropagation the update to an early weight is a product of the derivatives of every activation along the path (the chain rule, next chapter). Sigmoid's derivative never exceeds $0.25$ , and out in the tails it is essentially $0$ . Multiply a dozen numbers each $\le 0.25$ together and the signal reaching the first layer is astronomically small — the vanishing gradient problem that made deep sigmoid/tanh networks nearly untrainable.

ReLU is the default because it does not saturate on the positive side. For any $x > 0$ its derivative is exactly $1$ , so gradients pass through undiminished no matter how deep the stack. It is also trivially cheap — a single comparison. The costs are the kink at $0$ (a subgradient handles it) and dying units: a neuron stuck at $x < 0$ has gradient $0$ and may never recover. Leaky ReLU's small negative slope $\alpha$ and the smooth softplus both exist to soften that failure mode. In practice ReLU remains the first thing you reach for in a hidden layer, while sigmoid survives at the output of a binary classifier, where its $(0,1)$ range is read as a probability.

NumPy implementation

Let us implement all four core activations in a vectorized way — one call evaluates the whole np.linspace grid at once — and then verify the tanh identity numerically. The one subtlety is softplus: the naive np.log(1 + np.exp(x)) overflows for large x, so we use the numerically stable np.logaddexp(0, x), which computes $\ln(e^0 + e^x) = \ln(1 + e^x)$ without ever forming $e^x$ directly.

activations.py

import numpy as np
np.random.seed(0)

def sigmoid(x):
  # 1 / (1 + e^-x), vectorized over the whole array at once
  return 1.0 / (1.0 + np.exp(-x))

def tanh(x):
  return np.tanh(x)

def relu(x):
  # element-wise max against 0.0 (broadcast)
  return np.maximum(0.0, x)

def softplus(x):
  # STABLE ln(1 + e^x): logaddexp(0, x) = ln(e^0 + e^x), no overflow
  return np.logaddexp(0.0, x)

# Evaluate every activation on a grid spanning the saturated tails
x = np.linspace(-6.0, 6.0, 13)
print("x        =", np.round(x, 1))
print("sigmoid  =", np.round(sigmoid(x), 3))
print("relu     =", np.round(relu(x), 3))

# sigmoid saturates: near 0 on the left, near 1 on the right
assert sigmoid(x)[0] < 0.01 and sigmoid(x)[-1] > 0.99
# ReLU zeros the negative half exactly
assert np.all(relu(x[x < 0]) == 0.0)

# Identity check: tanh(x) == 2*sigmoid(2x) - 1  (see the derivation above)
assert np.allclose(tanh(x), 2.0 * sigmoid(2.0 * x) - 1.0)

# Softplus derivative is the sigmoid: check with a central finite difference
h = 1e-6
num_deriv = (softplus(x + h) - softplus(x - h)) / (2 * h)
assert np.allclose(num_deriv, sigmoid(x), atol=1e-5)

# Naive softplus overflows; the stable one does not
big = 1000.0
assert np.isfinite(softplus(big))          # stays finite
print("softplus(1000) =", softplus(big))   # ~= 1000.0, not inf

print("ok")

Every assertion here is a fact from the sections above turned into a runnable check: sigmoid saturates in the tails, ReLU zeros the negatives, tanh is a rescaled sigmoid, and softplus differentiates to the sigmoid. When your math and your code agree on a grid of points, you can trust both.

Two overflow traps

Naive softplus. Writing np.log(1 + np.exp(x)) computes np.exp(1000) first, which is inf, so the whole result is inf even though the true answer is $\approx 1000$ . Always use np.logaddexp(0, x) (or the identity $\zeta(x) = \max(0,x) + \ln(1 + e^{-|x|})$ ). The same care applies to a naive sigmoid: 1/(1+np.exp(-x)) overflows for large negative x — prefer scipy.special.expit or split on the sign of x in production code.

Sigmoid saturation. Because $\sigma'(x) \le 0.25$ and decays to $0$ in the tails, deep networks built only from sigmoids train painfully slowly. If your loss plateaus immediately and early-layer gradients are near zero, saturated activations are a prime suspect — switch hidden layers to ReLU.

Part II checkpoint — read equations like a researcher

You will meet these two equations constantly in papers. Work each one through the nine-step routine before revealing the solution.

Research Paper Equation Practice

The logistic (sigmoid) prediction

The core of logistic regression and of a single sigmoid output neuron: a linear score is squashed into a probability.

\hat{y} = \sigma\!\left(\mathbf{w}^\top \mathbf{x} + b\right)

Work through these steps:

Identify every symbol.
State the type of every object (scalar, vector, matrix, index, set, function).
State the dimensions / shapes.
Rewrite the equation in plain English.
Expand it for a tiny concrete example.
Identify the assumptions.
Convert it to pseudocode.
Implement it in NumPy.
Explain its machine-learning purpose.

Research Paper Equation Practice

Softplus activation

A smooth, everywhere-differentiable stand-in for ReLU whose derivative is the sigmoid.

\zeta(x) = \ln\!\left(1 + e^{x}\right)

Work through these steps:

Identify every symbol.
State the type of every object (scalar, vector, matrix, index, set, function).
State the dimensions / shapes.
Rewrite the equation in plain English.
Expand it for a tiny concrete example.
Identify the assumptions.
Convert it to pseudocode.
Implement it in NumPy.
Explain its machine-learning purpose.

Summary

A network needs nonlinear activations between linear layers; without them, any depth collapses to a single linear map and cannot model curved relationships.
Sigmoid $\sigma(x) = 1/(1+e^{-x})$ maps $\mathbb{R} \to (0,1)$ , S-shaped, $\sigma(0)=0.5$ ; derivative $\sigma(1-\sigma) \le 0.25$ → saturates and causes vanishing gradients.
Tanh maps to $(-1,1)$ , is zero-centered, and equals a rescaled sigmoid: $\tanh(x) = 2\sigma(2x) - 1$ . Also saturates.
ReLU $=\max(0,x)$ : unbounded above, non-saturating for $x > 0$ (gradient $1$ ), a kink at $0$ , and a dead zone for $x < 0$ — the deep-learning default.
Leaky ReLU gives negatives a small slope $\alpha$ to avoid dying units; softplus $\zeta(x)=\ln(1+e^{x})$ is a smooth ReLU whose derivative is exactly $\sigma(x)$ .
Implement activations vectorized; compute softplus with np.logaddexp(0, x) to avoid overflow.

Active recall

Answer from memory before checking the lesson:

Why can't a network of only linear layers (no activation) learn a curved decision boundary?
State the range of sigmoid, tanh, and ReLU. Which one is zero-centered?
What is $\sigma(0)$ , and what is the maximum value of $\sigma'(x)$ ? Where does it occur?
Explain in one sentence why saturated sigmoids cause vanishing gradients in a deep network.
What single advantage makes ReLU the default hidden activation, and what is its main failure mode?
Prove that $\zeta'(x) = \sigma(x)$ for softplus $\zeta(x) = \ln(1+e^{x})$ .

Exercises

Level ARecall & basic calculation

Level AHand calculationch06-A1

Sigmoid at zero

Compute $\sigma(0)$ for the sigmoid $\sigma(x) = \dfrac{1}{1 + e^{-x}}$ .

Level AHand calculationch06-A2

ReLU of a negative input

Compute $\mathrm{ReLU}(-3) = \max(0, -3)$ .

Level AHand calculationch06-A3

ReLU of a positive input

Compute $\mathrm{ReLU}(5) = \max(0, 5)$ .

Level AEquation interpretationch06-A4

Range of tanh

What is the range of $\tanh(x)$ ?

Level AHand calculationch06-A5

Leaky ReLU on a negative input

Using the course's Leaky ReLU with leak $\alpha = 0.1$ , so $\mathrm{LReLU}(x) = x$ for $x \ge 0$ and $\alpha x$ for $x < 0$ , compute $\mathrm{LReLU}(-4)$ .

Level AEquation interpretationch06-A6

Maximum slope of the sigmoid

The sigmoid derivative is $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ . What is its maximum value?

Level BConceptual understanding

Level BEquation interpretationch06-B1

Which activation is zero-centered?

Which of these hidden-layer activations is zero-centered (outputs symmetric about $0$ )?

Level BGraph interpretationch06-B2

Match the graph to the function

A plotted curve is $0$ for all negative inputs, then rises as a straight line of slope $1$ for positive inputs, with a sharp corner at the origin. Which function is it?

Level BShape reasoningch06-B3

Range of softplus

What is the range of softplus $\zeta(x) = \ln(1 + e^{x})$ ?

Level BML applicationch06-B4

Saturation and vanishing gradients

In a deep network of sigmoid layers, explain why pushing many neurons into their saturated regions (large $|x|$ ) makes the early layers train slowly. Reference the chain rule and the size of $\sigma'(x)$ .

Level BML applicationch06-B5

Sigmoid output, ReLU hidden

A binary classifier commonly uses ReLU in its hidden layers but a sigmoid on the final output neuron. Give the reason for each choice in one sentence.

Level CDerivation & implementation

Level CNumPy implementationch06-C1

Implement ReLU and Leaky ReLU

Implement vectorized relu(x) and leaky_relu(x, alpha=0.1) for a 1-D NumPy array. Verify on x = np.array([-2.0, 0.0, 3.0]) that ReLU gives [0, 0, 3] and Leaky ReLU gives [-0.2, 0, 3], then print ok.

Level CNumPy implementationch06-C2

Numerically stable softplus

Implement softplus(x) that does not overflow for large inputs, and confirm softplus(1000.0) is finite (and $\approx 1000$ ) while the naive np.log(1 + np.exp(1000.0)) is inf. Also check numerically that the derivative of softplus equals the sigmoid. Print ok.

Level CDerivationch06-C3

Derive the sigmoid's derivative

Show that $\sigma'(x) = \sigma(x)\,(1 - \sigma(x))$ for $\sigma(x) = (1 + e^{-x})^{-1}$ .

Level CNumerical experimentch06-C4

Simulate a vanishing gradient

Empirically show gradient vanishing: multiply together the sigmoid derivatives $\sigma'(z)$ across a stack of layers whose pre-activations are all saturated (e.g. $z = 6$ ). Compare the product for a 20-layer sigmoid stack against a 20-layer ReLU stack (per-layer derivative $1$ for $z > 0$ ). Print both products, assert the sigmoid product is far smaller, and print ok.

Level DResearch-thinking challenge

Level DPaper-reading practicech06-D1

Why did ReLU help deep networks?

Before ~2011, deep networks used sigmoid/tanh and were notoriously hard to train past a few layers; switching to ReLU was a turning point. Explain the main mechanism by which ReLU improved trainability of deep nets, name at least one additional practical benefit, and state one drawback ReLU introduced (with a named remedy).

Level DPaper-reading practicech06-D2

Choosing a positivity-preserving output activation

You are designing a network whose scalar output must be strictly positive (e.g. it predicts the variance $v$ of a Gaussian, or a rate parameter). Compare using $\mathrm{ReLU}$ , $\exp$ , and softplus $\zeta$ as the final activation. Which would you choose and why? Address the range, gradient behavior, and numerical stability of each.