Functions

Why functions are the whole game

Strip away the jargon and a machine-learning model is one thing: a function. It takes an input — an image, a sentence, a row of features — and returns an output — a label, a probability, a next word. Training is the search for a good function inside a huge family of candidates. Even the training objective, the loss, is itself a function: it takes the candidate's parameters and returns a single number saying how badly it does.

So before we can talk about learning anything, we need to be fluent with functions — not the vague "plug in a number" version from school, but the precise object mathematicians use: a rule that assigns to every input exactly one output. That precision is what lets us compose functions into deep networks, invert them when we need to undo a transform, and reason about which inputs are even allowed. This chapter builds that fluency, seen three ways at once:

Formally, as a mapping $f: X \to Y$ from a domain to a codomain.
Graphically, as a curve you can read off an axis.
Computationally, as a Python function you can evaluate on a grid.

Intuition: a function is a reliable machine

Picture a machine with an input slot and an output slot. Feed it a value $x$ and it hands back a value $f(x)$ . The one rule that makes it a function rather than just a "process" is determinism: the same input always yields the same output. Put in $2$ and get $4$ ; put in $2$ again and you must get $4$ again, forever. A machine that sometimes returns $4$ and sometimes $5$ for the same input is not a function.

That single rule has teeth. It forbids one input mapping to two outputs, which is exactly the "vertical line test" you may remember: no vertical line may cross the graph twice. It says nothing, though, about two inputs sharing one output — that is allowed, and whether it happens is precisely the question of invertibility we return to later.

Interactive LabFunction Explorer

Loading interactive lab…

Drag the input slider above and watch the output move. Notice the graph is just a record of every (input, output) pair the machine can produce: the height of the curve at horizontal position $x$ is the value $f(x)$ .

Formal definitions

Two clauses do all the work. "Each element of $X$ " means the function must be defined on every input in the domain — no gaps. "Exactly one element of $Y$ " means it must be single-valued — no input produces two outputs. Together they are the determinism from the intuition, stated set-theoretically.

Symbol	Meaning	Type	Shape	Role
$f: X \to Y$	A function from domain X to codomain Y	mapping	—	definition
$x$	An input (element of the domain)	scalar	1	variable
$f(x)$	The output at x (the value)	scalar	1	variable
$X$	Domain — set of allowed inputs	set	—	fixed
$Y$	Codomain — set of possible outputs	set	—	fixed
$\operatorname{range}(f)$	Set of outputs actually attained	set	—	derived

The domain is not decoration — it is part of the function's identity. The rule $x \mapsto 1/x$ on $X = \mathbb{R}$ is not a function (it is undefined at $0$ ); the same rule on $X = \mathbb{R}\setminus\{0\}$ is one. Changing the domain changes the object.

Composition

Feeding one machine's output into another's input composes them.

The right-to-left reading trips everyone up once. In $g \circ f$ the function written last runs first, because it is the one sitting next to the input $x$ . Order matters: in general

g \circ f \;\neq\; f \circ g

(5.1)

Composition is the reason a deep network is "deep": a two-layer network is $f_2 \circ f_1$ , an $L$ -layer network is $f_L \circ \cdots \circ f_2 \circ f_1$ , and the chain rule we meet later is exactly the rule for differentiating such a stack.

Inverse functions

One-to-one is the crux. If two different inputs $x_1 \neq x_2$ share an output $f(x_1) = f(x_2) = y$ , then $f^{-1}(y)$ cannot decide between them, so no inverse can exist. Graphically this is the horizontal line test: $f$ is invertible iff no horizontal line meets the graph more than once. Note $f^{-1}$ means the inverse function, not the reciprocal $1/f$ — a genuinely unfortunate collision of notation.

A numerical example

Let $f(x) = 2x + 1$ and $g(x) = x^2$ , both on domain $\mathbb{R}$ . Evaluate each composition at $x = 3$ .

Worked composition: the two orders as formulas

Rather than plug in one point, compose symbolically to see why the orders differ everywhere, not just at $x = 3$ .

We can also invert $f$ . To undo $y = 2x + 1$ , solve for $x$ : subtract $1$ , divide by $2$ , giving $f^{-1}(y) = (y - 1)/2$ . Check: $f^{-1}(f(x)) = (2x + 1 - 1)/2 = x$ . Because $f$ is a line with nonzero slope it is one-to-one, so the inverse exists. By contrast $g(x) = x^2$ on all of $\mathbb{R}$ is not invertible: $g(2) = g(-2) = 4$ , so $g^{-1}(4)$ is ambiguous. Restrict the domain to $x \ge 0$ and it becomes one-to-one, with inverse $\sqrt{y}$ — the domain restriction is what makes the square root a function at all.

ML use case: a model and a loss are just functions

Two functions sit at the heart of every supervised learner, and keeping their inputs straight is half the battle.

The model is a function of the input, with the parameters held fixed:

\hat{y} = f(x; \theta)

(5.2)

Here $x$ is the data point and $\theta$ (the weights and biases) is a knob-setting you carry along. The semicolon is doing real work: it separates the input $x$ from the parameters $\theta$ . At prediction time $\theta$ is frozen and $x$ varies — the model is a function of $x$ .

The loss flips which argument varies. It measures how wrong the predictions are over a fixed dataset, as a function of the parameters:

L(\theta) = \frac{1}{n} \sum_{i=1}^{n} \ell\big(f(x_i; \theta),\, y_i\big)

(5.3)

Now the data $(x_i, y_i)$ is frozen and $\theta$ varies — the loss is a function of $\theta$ . Training means finding the $\theta$ that minimizes $L$ . This input-swap is the single most important reframing in the course: the same expression $f(x_i; \theta)$ is read as a function of $x$ when predicting and as a function of $\theta$ when learning.

And $L$ is a composition: apply the model $f(\cdot\,; \theta)$ , then the per-example loss $\ell$ , then average. That layered structure is exactly what the chain rule will let us differentiate, so that we can compute $\partial L / \partial \theta$ and descend. Every idea in this chapter — mapping, domain, composition, invertibility — resurfaces the moment we start training.

NumPy implementation

Let us make composition concrete. We implement $f$ and $g$ as ordinary Python functions, evaluate both compositions on a grid built with np.linspace, and confirm numerically that $f(g(x)) \neq g(f(x))$ in general. Run it:

composition.py

import numpy as np

np.random.seed(0)  # reproducibility (unused here, but a good habit)

# Two simple functions, written as plain Python that works on arrays.
def f(x):
  return 2.0 * x + 1.0       # a line: domain and range are all of R

def g(x):
  return x ** 2             # a parabola: range is [0, inf)

# The two composition orders, as functions themselves.
def g_after_f(x):
  return g(f(x))            # (g o f)(x) = (2x+1)^2

def f_after_g(x):
  return f(g(x))            # (f o g)(x) = 2x^2 + 1

# Evaluate on a grid of 9 points from -2 to 2.
xs = np.linspace(-2.0, 2.0, 9)
gf = g_after_f(xs)
fg = f_after_g(xs)

print("x    :", xs)
print("g(f) :", gf)
print("f(g) :", fg)

# They agree only at the crossing points x = 0 and x = -2, not everywhere.
same_everywhere = np.allclose(gf, fg)
print("equal everywhere:", same_everywhere)   # False
assert not same_everywhere, "composition order should matter"

# Spot-check the point x = 3 from the lesson.
assert np.isclose(g_after_f(3.0), 49.0)
assert np.isclose(f_after_g(3.0), 19.0)

# Invert f: solving y = 2x + 1 gives x = (y - 1)/2.
def f_inv(y):
  return (y - 1.0) / 2.0

assert np.allclose(f_inv(f(xs)), xs)   # f_inv undoes f on every grid point
print("ok")

The grid pattern — build inputs with np.linspace, push them through a function, read off the outputs — is how we will visualize every function from here on, including loss curves. Because f and g are written with array-friendly operations (*, +, **), they evaluate on the whole grid at once with no loop: the same vectorized thinking from the previous chapter.

Summary

A function $f: X \to Y$ assigns to each input in the domain $X$ exactly one output; the range is the set of outputs actually produced, a subset of the codomain $Y$ .
The domain is part of the function's identity: $1/x$ is undefined at $0$ , and restricting a domain can turn a non-function or a non-invertible map into a valid, invertible one.
Composition $(g \circ f)(x) = g(f(x))$ chains functions and is read right-to-left; in general $g \circ f \neq f \circ g$ , so order changes the result.
A function is invertible exactly when it is one-to-one; then $f^{-1}$ undoes it. $g(x)=x^2$ on $\mathbb{R}$ fails (it maps $2$ and $-2$ to $4$ ) until the domain is restricted.
In ML, the model $f(x; \theta)$ is a function of the input $x$ ; the loss $L(\theta)$ is a function of the parameters $\theta$ . The loss is a composition, which previews the chain rule and layered networks.

Active recall

Answer from memory before checking the lesson:

State the two conditions a rule must satisfy to be a function $f: X \to Y$ .
What is the difference between the codomain and the range?
Evaluate $(g \circ f)(2)$ and $(f \circ g)(2)$ for $f(x) = x + 3$ and $g(x) = 2x$ . Are they equal?
Why is $g(x) = x^2$ on all of $\mathbb{R}$ not invertible, and how do you fix it?
In $f(x; \theta)$ versus $L(\theta)$ , which argument varies at prediction time and which varies at training time?

Exercises

Level ARecall & basic calculation

Level AHand calculationch05-A1

Evaluate a function

Let $f(x) = 3x - 4$ . Compute $f(5)$ .

Level AHand calculationch05-A2

Compose at a point

Let $f(x) = x + 3$ and $g(x) = 2x$ . Compute $(g \circ f)(4)$ .

Level AHand calculationch05-A3

Order matters

For $f(x) = x + 3$ and $g(x) = 2x$ , compute $(f \circ g)(4)$ . (Compare with A2, where $(g \circ f)(4) = 14$ .)

Level AHand calculationch05-A4

Find an inverse value

The function $f(x) = 2x + 1$ has inverse $f^{-1}(y) = \dfrac{y - 1}{2}$ . Compute $f^{-1}(9)$ .

Level AEquation interpretationch05-A5

Domain of a reciprocal

For $f(x) = \dfrac{1}{x - 2}$ over the real numbers, which single value of $x$ must be excluded from the domain?

Level AEquation interpretationch05-A6

Range of a square

For $g(x) = x^2$ with domain all real numbers, what is the smallest value in the range?

Level BConceptual understanding

Level BGraph interpretationch05-B1

Is it a function?

A relation is graphed in the plane. Which single test decides whether it defines $y$ as a function of $x$ ?

Level BEquation interpretationch05-B2

Codomain versus range

Explain in one or two sentences the difference between the codomain and the range of a function $f: X \to Y$ , using $f(x) = x^2$ with $X = Y = \mathbb{R}$ as an example.

Level BEquation interpretationch05-B3

Which function is invertible?

Each function below has domain all of $\mathbb{R}$ . Which one is invertible (one-to-one)?

Level BML applicationch05-B4

Inputs of a model and a loss

A model is written $\hat{y} = f(x; \theta)$ and its loss $L(\theta)$ . In one or two sentences, say which quantity is held fixed and which varies (a) at prediction time and (b) at training time, and why $L$ is written as a function of $\theta$ alone.

Level CDerivation & implementation

Level CDerivationch05-C1

Compose two functions symbolically

Let $f(x) = 3x - 1$ and $g(x) = x^2 + 2$ . Derive closed-form expressions for $(g \circ f)(x)$ and $(f \circ g)(x)$ , and confirm they are different functions.

Level CDerivationch05-C2

Derive an inverse

The function $f(x) = \dfrac{x - 4}{3}$ is one-to-one on $\mathbb{R}$ . Derive its inverse $f^{-1}(y)$ and verify $f^{-1}(f(x)) = x$ .

Level CNumPy implementationch05-C3

Composition on a grid in NumPy

Implement $f(x) = x + 1$ and $g(x) = x^2$ as Python functions. Using np.linspace(-3, 3, 7), evaluate $g \circ f$ and $f \circ g$ on the grid, confirm with np.allclose that they are not equal everywhere, and print ok.

Level DResearch-thinking challenge

Level DPaper-reading practicech05-D1

Why deep networks are compositions

A network layer is a function $f_\ell$ , and an $L$ -layer network is the composition $f_L \circ \cdots \circ f_1$ . (a) Explain why stacking only linear layers $f_\ell(x) = W_\ell x$ gains no expressive power over a single linear layer. (b) Explain what a nonlinear activation inserted between layers changes. (c) Connect the composition structure to why the chain rule is central to training.