Part 2 · Functions and GraphsChapter 560 min

Functions

Mappings, domain, range, composition, and inverses

Learning objectives

  • State the formal definition of a function and read function notation
  • Determine domain and range, and interpret graphs
  • Compose functions and reason about invertibility
  • See a model f(x; θ) and a loss L(θ) as functions

Why functions are the whole game

Strip away the jargon and a machine-learning model is one thing: a function. It takes an input — an image, a sentence, a row of features — and returns an output — a label, a probability, a next word. Training is the search for a good function inside a huge family of candidates. Even the training objective, the loss, is itself a function: it takes the candidate's parameters and returns a single number saying how badly it does.

So before we can talk about learning anything, we need to be fluent with functions — not the vague "plug in a number" version from school, but the precise object mathematicians use: a rule that assigns to every input exactly one output. That precision is what lets us compose functions into deep networks, invert them when we need to undo a transform, and reason about which inputs are even allowed. This chapter builds that fluency, seen three ways at once:

  • Formally, as a mapping f:XYf: X \to Y from a domain to a codomain.
  • Graphically, as a curve you can read off an axis.
  • Computationally, as a Python function you can evaluate on a grid.

Intuition: a function is a reliable machine

Picture a machine with an input slot and an output slot. Feed it a value xx and it hands back a value f(x)f(x). The one rule that makes it a function rather than just a "process" is determinism: the same input always yields the same output. Put in 22 and get 44; put in 22 again and you must get 44 again, forever. A machine that sometimes returns 44 and sometimes 55 for the same input is not a function.

That single rule has teeth. It forbids one input mapping to two outputs, which is exactly the "vertical line test" you may remember: no vertical line may cross the graph twice. It says nothing, though, about two inputs sharing one output — that is allowed, and whether it happens is precisely the question of invertibility we return to later.

Interactive LabFunction Explorer
Loading interactive lab…

Drag the input slider above and watch the output move. Notice the graph is just a record of every (input, output) pair the machine can produce: the height of the curve at horizontal position xx is the value f(x)f(x).

Formal definitions

Two clauses do all the work. "Each element of XX" means the function must be defined on every input in the domain — no gaps. "Exactly one element of YY" means it must be single-valued — no input produces two outputs. Together they are the determinism from the intuition, stated set-theoretically.

The domain is not decoration — it is part of the function's identity. The rule x1/xx \mapsto 1/x on X=RX = \mathbb{R} is not a function (it is undefined at 00); the same rule on X=R{0}X = \mathbb{R}\setminus\{0\} is one. Changing the domain changes the object.

Composition

Feeding one machine's output into another's input composes them.

The right-to-left reading trips everyone up once. In gfg \circ f the function written last runs first, because it is the one sitting next to the input xx. Order matters: in general

Composition is the reason a deep network is "deep": a two-layer network is f2f1f_2 \circ f_1, an LL-layer network is fLf2f1f_L \circ \cdots \circ f_2 \circ f_1, and the chain rule we meet later is exactly the rule for differentiating such a stack.

Inverse functions

One-to-one is the crux. If two different inputs x1x2x_1 \neq x_2 share an output f(x1)=f(x2)=yf(x_1) = f(x_2) = y, then f1(y)f^{-1}(y) cannot decide between them, so no inverse can exist. Graphically this is the horizontal line test: ff is invertible iff no horizontal line meets the graph more than once. Note f1f^{-1} means the inverse function, not the reciprocal 1/f1/f — a genuinely unfortunate collision of notation.

A numerical example

Let f(x)=2x+1f(x) = 2x + 1 and g(x)=x2g(x) = x^2, both on domain R\mathbb{R}. Evaluate each composition at x=3x = 3.

Worked composition: the two orders as formulas

Rather than plug in one point, compose symbolically to see why the orders differ everywhere, not just at x=3x = 3.

We can also invert ff. To undo y=2x+1y = 2x + 1, solve for xx: subtract 11, divide by 22, giving f1(y)=(y1)/2f^{-1}(y) = (y - 1)/2. Check: f1(f(x))=(2x+11)/2=xf^{-1}(f(x)) = (2x + 1 - 1)/2 = x. Because ff is a line with nonzero slope it is one-to-one, so the inverse exists. By contrast g(x)=x2g(x) = x^2 on all of R\mathbb{R} is not invertible: g(2)=g(2)=4g(2) = g(-2) = 4, so g1(4)g^{-1}(4) is ambiguous. Restrict the domain to x0x \ge 0 and it becomes one-to-one, with inverse y\sqrt{y} — the domain restriction is what makes the square root a function at all.

ML use case: a model and a loss are just functions

Two functions sit at the heart of every supervised learner, and keeping their inputs straight is half the battle.

The model is a function of the input, with the parameters held fixed:

Here xx is the data point and θ\theta (the weights and biases) is a knob-setting you carry along. The semicolon is doing real work: it separates the input xx from the parameters θ\theta. At prediction time θ\theta is frozen and xx varies — the model is a function of xx.

The loss flips which argument varies. It measures how wrong the predictions are over a fixed dataset, as a function of the parameters:

Now the data (xi,yi)(x_i, y_i) is frozen and θ\theta varies — the loss is a function of θ\theta. Training means finding the θ\theta that minimizes LL. This input-swap is the single most important reframing in the course: the same expression f(xi;θ)f(x_i; \theta) is read as a function of xx when predicting and as a function of θ\theta when learning.

And LL is a composition: apply the model f(;θ)f(\cdot\,; \theta), then the per-example loss \ell, then average. That layered structure is exactly what the chain rule will let us differentiate, so that we can compute L/θ\partial L / \partial \theta and descend. Every idea in this chapter — mapping, domain, composition, invertibility — resurfaces the moment we start training.

NumPy implementation

Let us make composition concrete. We implement ff and gg as ordinary Python functions, evaluate both compositions on a grid built with np.linspace, and confirm numerically that f(g(x))g(f(x))f(g(x)) \neq g(f(x)) in general. Run it:

composition.py

The grid pattern — build inputs with np.linspace, push them through a function, read off the outputs — is how we will visualize every function from here on, including loss curves. Because f and g are written with array-friendly operations (*, +, **), they evaluate on the whole grid at once with no loop: the same vectorized thinking from the previous chapter.

Summary

  • A function f:XYf: X \to Y assigns to each input in the domain XX exactly one output; the range is the set of outputs actually produced, a subset of the codomain YY.
  • The domain is part of the function's identity: 1/x1/x is undefined at 00, and restricting a domain can turn a non-function or a non-invertible map into a valid, invertible one.
  • Composition (gf)(x)=g(f(x))(g \circ f)(x) = g(f(x)) chains functions and is read right-to-left; in general gffgg \circ f \neq f \circ g, so order changes the result.
  • A function is invertible exactly when it is one-to-one; then f1f^{-1} undoes it. g(x)=x2g(x)=x^2 on R\mathbb{R} fails (it maps 22 and 2-2 to 44) until the domain is restricted.
  • In ML, the model f(x;θ)f(x; \theta) is a function of the input xx; the loss L(θ)L(\theta) is a function of the parameters θ\theta. The loss is a composition, which previews the chain rule and layered networks.

Active recall

Answer from memory before checking the lesson:

  1. State the two conditions a rule must satisfy to be a function f:XYf: X \to Y.
  2. What is the difference between the codomain and the range?
  3. Evaluate (gf)(2)(g \circ f)(2) and (fg)(2)(f \circ g)(2) for f(x)=x+3f(x) = x + 3 and g(x)=2xg(x) = 2x. Are they equal?
  4. Why is g(x)=x2g(x) = x^2 on all of R\mathbb{R} not invertible, and how do you fix it?
  5. In f(x;θ)f(x; \theta) versus L(θ)L(\theta), which argument varies at prediction time and which varies at training time?

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge