Important Functions in Machine Learning
Linear, quadratic, sigmoid, tanh, ReLU, softplus
Prerequisites
Learning objectives
- Recognize the shape, domain, range, and asymptotes of each activation
- Explain what problem each nonlinearity solves in a network
- Relate sigmoid, tanh, and softplus algebraically
- Anticipate saturation and vanishing-gradient behavior from a graph
Why a network needs the right functions
A neuron, as we saw, computes a dot product plus a bias: . That is a linear function of its input. Stack two linear layers and you get… another linear function — the composition of linear maps is still linear. No matter how many layers you pile up, a purely linear network can only draw a straight decision boundary. It cannot learn XOR, cannot bend a curve, cannot separate two interleaved spirals.
The fix is to insert a nonlinear function after each linear step. That single scalar function — applied element-wise to every neuron's pre-activation — is what lets a deep network represent curved, non-linear relationships. These are the activation functions, and this chapter is a tour of the handful that matter: sigmoid, tanh, ReLU, Leaky ReLU, and softplus. For each one we care about the same short checklist: its shape, its domain and range, where it saturates, what problem it solves, and — most important for training — how its gradient behaves.
We start from two functions you already know, then build up.
Intuition: squashers vs. rectifiers
Two families cover almost everything in practice.
- Squashers (sigmoid, tanh) take the whole real line and crush it into a bounded interval — for sigmoid, for tanh. They are smooth S-curves. The price of that bounded output is saturation: far out in either tail the curve goes flat, so its slope — the gradient — collapses toward zero.
- Rectifiers (ReLU and friends) leave positive inputs untouched and clamp negatives. They do not saturate on the positive side, so their gradient stays a healthy no matter how large the input — but they pay with a kink at the origin and a dead flat region for negative inputs.
Keep that trade-off in mind — bounded and smooth but saturating versus unbounded and non-saturating but kinked — and every choice below will make sense.
Select each function in the explorer and watch two things: the curve itself, and the derivative it plots underneath. The activation is only half the story; the derivative is what the network trains on. Where the derivative is near zero, learning stalls.
Formal definitions
Recall the two baselines from the function chapter. A linear function has a constant slope and range ; a quadratic is the prototype convex bowl (for ) with a single minimum. Activations are the nonlinear layer we add on top.
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| Sigmoid / logistic | function | ℝ→(0,1) | activation | |
| Hyperbolic tangent (zero-centered) | function | ℝ→(−1,1) | activation | |
| Rectified linear unit, max(0,x) | function | ℝ→[0,∞) | activation | |
| Leak slope for negative inputs | scalar | 1 | hyperparameter | |
| Softplus, ln(1+eˣ) | function | ℝ→(0,∞) | activation | |
| Pre-activation wᵀx + b (the input to σ) | scalar | 1 | variable |
Numerical example: reading a sigmoid off the number line
Plug three inputs into to feel the saturation directly.
Derivations
Sigmoid and tanh are the same curve, rescaled
The two squashers are not independent functions — one is an affine rescaling of the other.
Softplus is the integral of the sigmoid
The claim is worth doing by hand because it explains the whole design of softplus.
ML use case: nonlinearity, saturation, and why ReLU won
Nonlinearity is the point. A layer computes where is the activation applied element-wise. Remove (or make it linear) and a hundred-layer network algebraically collapses to a single linear map. The activation is precisely what buys representational power — the ability to approximate curved functions and carve non-linear decision boundaries.
Saturation kills gradients. During backpropagation the update to an early weight is a product of the derivatives of every activation along the path (the chain rule, next chapter). Sigmoid's derivative never exceeds , and out in the tails it is essentially . Multiply a dozen numbers each together and the signal reaching the first layer is astronomically small — the vanishing gradient problem that made deep sigmoid/tanh networks nearly untrainable.
ReLU is the default because it does not saturate on the positive side. For any its derivative is exactly , so gradients pass through undiminished no matter how deep the stack. It is also trivially cheap — a single comparison. The costs are the kink at (a subgradient handles it) and dying units: a neuron stuck at has gradient and may never recover. Leaky ReLU's small negative slope and the smooth softplus both exist to soften that failure mode. In practice ReLU remains the first thing you reach for in a hidden layer, while sigmoid survives at the output of a binary classifier, where its range is read as a probability.
NumPy implementation
Let us implement all four core activations in a vectorized way — one call
evaluates the whole np.linspace grid at once — and then verify the tanh identity
numerically. The one subtlety is softplus: the naive np.log(1 + np.exp(x))
overflows for large x, so we use the numerically stable np.logaddexp(0, x),
which computes without ever forming directly.
Every assertion here is a fact from the sections above turned into a runnable check: sigmoid saturates in the tails, ReLU zeros the negatives, tanh is a rescaled sigmoid, and softplus differentiates to the sigmoid. When your math and your code agree on a grid of points, you can trust both.
Part II checkpoint — read equations like a researcher
You will meet these two equations constantly in papers. Work each one through the nine-step routine before revealing the solution.
The logistic (sigmoid) prediction
The core of logistic regression and of a single sigmoid output neuron: a linear score is squashed into a probability.
Work through these steps:
- Identify every symbol.
- State the type of every object (scalar, vector, matrix, index, set, function).
- State the dimensions / shapes.
- Rewrite the equation in plain English.
- Expand it for a tiny concrete example.
- Identify the assumptions.
- Convert it to pseudocode.
- Implement it in NumPy.
- Explain its machine-learning purpose.
Softplus activation
A smooth, everywhere-differentiable stand-in for ReLU whose derivative is the sigmoid.
Work through these steps:
- Identify every symbol.
- State the type of every object (scalar, vector, matrix, index, set, function).
- State the dimensions / shapes.
- Rewrite the equation in plain English.
- Expand it for a tiny concrete example.
- Identify the assumptions.
- Convert it to pseudocode.
- Implement it in NumPy.
- Explain its machine-learning purpose.
Summary
- A network needs nonlinear activations between linear layers; without them, any depth collapses to a single linear map and cannot model curved relationships.
- Sigmoid maps , S-shaped, ; derivative → saturates and causes vanishing gradients.
- Tanh maps to , is zero-centered, and equals a rescaled sigmoid: . Also saturates.
- ReLU : unbounded above, non-saturating for (gradient ), a kink at , and a dead zone for — the deep-learning default.
- Leaky ReLU gives negatives a small slope to avoid dying units; softplus is a smooth ReLU whose derivative is exactly .
- Implement activations vectorized; compute softplus with
np.logaddexp(0, x)to avoid overflow.
Active recall
Answer from memory before checking the lesson:
- Why can't a network of only linear layers (no activation) learn a curved decision boundary?
- State the range of sigmoid, tanh, and ReLU. Which one is zero-centered?
- What is , and what is the maximum value of ? Where does it occur?
- Explain in one sentence why saturated sigmoids cause vanishing gradients in a deep network.
- What single advantage makes ReLU the default hidden activation, and what is its main failure mode?
- Prove that for softplus .
Exercises
Level ARecall & basic calculation
Sigmoid at zero
Compute for the sigmoid .
ReLU of a negative input
Compute .
ReLU of a positive input
Compute .
Range of tanh
What is the range of ?
Leaky ReLU on a negative input
Using the course's Leaky ReLU with leak , so for and for , compute .
Maximum slope of the sigmoid
The sigmoid derivative is . What is its maximum value?
Level BConceptual understanding
Which activation is zero-centered?
Which of these hidden-layer activations is zero-centered (outputs symmetric about )?
Match the graph to the function
A plotted curve is for all negative inputs, then rises as a straight line of slope for positive inputs, with a sharp corner at the origin. Which function is it?
Range of softplus
What is the range of softplus ?
Saturation and vanishing gradients
In a deep network of sigmoid layers, explain why pushing many neurons into their saturated regions (large ) makes the early layers train slowly. Reference the chain rule and the size of .
Sigmoid output, ReLU hidden
A binary classifier commonly uses ReLU in its hidden layers but a sigmoid on the final output neuron. Give the reason for each choice in one sentence.
Level CDerivation & implementation
Implement ReLU and Leaky ReLU
Implement vectorized relu(x) and leaky_relu(x, alpha=0.1) for a 1-D NumPy array. Verify on x = np.array([-2.0, 0.0, 3.0]) that ReLU gives [0, 0, 3] and Leaky ReLU gives [-0.2, 0, 3], then print ok.
Numerically stable softplus
Implement softplus(x) that does not overflow for large inputs, and confirm softplus(1000.0) is finite (and ) while the naive np.log(1 + np.exp(1000.0)) is inf. Also check numerically that the derivative of softplus equals the sigmoid. Print ok.
Derive the sigmoid's derivative
Show that for .
Simulate a vanishing gradient
Empirically show gradient vanishing: multiply together the sigmoid derivatives across a stack of layers whose pre-activations are all saturated (e.g. ). Compare the product for a 20-layer sigmoid stack against a 20-layer ReLU stack (per-layer derivative for ). Print both products, assert the sigmoid product is far smaller, and print ok.
Level DResearch-thinking challenge
Why did ReLU help deep networks?
Before ~2011, deep networks used sigmoid/tanh and were notoriously hard to train past a few layers; switching to ReLU was a turning point. Explain the main mechanism by which ReLU improved trainability of deep nets, name at least one additional practical benefit, and state one drawback ReLU introduced (with a named remedy).
Choosing a positivity-preserving output activation
You are designing a network whose scalar output must be strictly positive (e.g. it predicts the variance of a Gaussian, or a rate parameter). Compare using , , and softplus as the final activation. Which would you choose and why? Address the range, gradient behavior, and numerical stability of each.