Norms, Distances, and Similarity

Why we need to measure vectors

The last chapter turned every object in machine learning — a word, an image, a user — into a vector. But a list of numbers on its own is inert. The moment a model does anything useful it is measuring: how big is this weight vector, how far is this point from that cluster centroid, how similar is this query embedding to that document embedding. Regularization penalizes size. Nearest-neighbor search ranks by distance. Retrieval ranks by similarity. All three are the same question asked with different rulers.

This chapter installs those rulers. A norm measures the size of a single vector; a distance measures how far apart two vectors are; cosine similarity measures whether two vectors point the same way. They are not interchangeable — choosing L1 versus L2, or distance versus cosine, changes what your model rewards. By the end you will know the three workhorse norms, the one family that contains them, and exactly which one shows up in ridge regression, Lasso, kNN, and embedding search.

Intuition: three ways to walk to the corner store

Stand at the origin and walk to the point $\mathbf{x} = (3, 4)$ . How far did you go? The honest answer is "it depends on how you're allowed to move."

If you can fly straight to it, you travel $\sqrt{3^2 + 4^2} = 5$ . That is the L2 (Euclidean) length — the ordinary "as the crow flies" distance.
If you must follow a city grid, walking only along axes, you cover $|3| + |4| = 7$ blocks. That is the L1 (Manhattan / taxicab) length.
If someone asks only for the single largest move along any one axis, the answer is $\max(|3|, |4|) = 4$ . That is the L∞ (Chebyshev / max) length.

Same vector, three legitimate sizes. Each norm weights the components differently: L2 rewards spreading magnitude across components, L1 counts total absolute displacement, and L∞ cares only about the single worst component. Those three biases are exactly why they end up in different corners of ML.

Interactive LabVector Playground

Loading interactive lab…

Drag $\mathbf{a}$ in the lab and watch its length change. The readout shows the Euclidean (L2) length; hold the geometry in mind as we generalize it below.

Formal definitions

A norm is a function $\lVert\cdot\rVert : \mathbb{R}^n \to \mathbb{R}$ that assigns a non-negative "size" to a vector, is zero only for the zero vector, scales as $\lVert c\mathbf{x}\rVert = |c|\,\lVert\mathbf{x}\rVert$ , and obeys the triangle inequality $\lVert\mathbf{a}+\mathbf{b}\rVert \le \lVert\mathbf{a}\rVert + \lVert\mathbf{b}\rVert$ . The three norms below are the ones you will meet daily.

These three are special cases of one family.

Symbol	Meaning	Type	Shape	Role
$\lVert \mathbf{x} \rVert_p$	Lp norm of a vector	scalar	1	variable
$\lVert \mathbf{x} \rVert_1$	L1 norm: sum of absolute values	scalar	1	variable
$\lVert \mathbf{x} \rVert_2$	L2 norm: Euclidean length	scalar	1	variable
$\lVert \mathbf{x} \rVert_\infty$	L∞ norm: largest absolute component	scalar	1	variable
$d(\mathbf{a}, \mathbf{b})$	Euclidean distance between two vectors	scalar	1	variable
$\cos\theta$	Cosine similarity of two vectors, in [-1, 1]	scalar	1	variable
$p$	Order of the norm (p ≥ 1; ∞ allowed)	scalar	1	fixed

Distance is not a new idea — it is a norm applied to a difference.

Finally, the measure of direction rather than size.

Two vectors are orthogonal exactly when $\mathbf{a}\cdot\mathbf{b} = 0$ , equivalently $\cos\theta = 0$ , equivalently $\theta = 90°$ . Orthogonality is the geometric statement "these two directions share nothing."

A numerical example

Where cosine comes from

Cosine similarity is not an arbitrary formula — it falls straight out of the geometric form of the dot product from the previous chapter.

cosine similarity from the dot product

The dot product has two equal expressions: the algebraic sum $\mathbf{a}\cdot\mathbf{b} = \sum_i a_i b_i$ , and the geometric form $\mathbf{a}\cdot\mathbf{b} = \lVert\mathbf{a}\rVert_2\,\lVert\mathbf{b}\rVert_2\cos\theta,$ where $\theta$ is the angle between the vectors. Assuming both vectors are nonzero, their norms are positive, so we may divide both sides by $\lVert\mathbf{a}\rVert_2\,\lVert\mathbf{b}\rVert_2$ : $\cos\theta = \frac{\mathbf{a}\cdot\mathbf{b}}{\lVert\mathbf{a}\rVert_2\,\lVert\mathbf{b}\rVert_2}.$ Because $-1 \le \cos\theta \le 1$ for any real angle, the right-hand side is automatically bounded to $[-1, 1]$ — a fact worth remembering when a numerical result drifts to $1.0000001$ from floating-point error and needs clamping. Dividing by the two norms is exactly the step that removes length and leaves pure direction: replacing $\mathbf{a}$ by $c\mathbf{a}$ for any $c > 0$ multiplies the numerator and denominator by the same $c$ , leaving $\cos\theta$ unchanged. That scale invariance is the whole reason cosine, not the raw dot product, is used to compare embeddings.

Where each ruler appears in ML

The choice of norm is a modeling decision. Three canonical places:

L2 → ridge regression and weight decay. Adding the penalty $\lambda\lVert\mathbf{w}\rVert_2^2 = \lambda\sum_i w_i^2$ to a loss shrinks all weights smoothly toward zero — this is ridge regression, and the identical term under the name weight decay regularizes almost every neural network. Because squaring punishes large weights hardest, L2 discourages any single weight from dominating but rarely drives one to exactly zero.

L2 → Euclidean k-nearest-neighbors. kNN classifies a point by the labels of its nearest neighbors under $d(\mathbf{a}, \mathbf{b}) = \lVert\mathbf{a} - \mathbf{b}\rVert_2$ . The straight-line ruler defines "nearest," so feature scaling matters enormously: a feature measured in thousands will dominate the sum of squares unless standardized.

L1 → Lasso and sparsity. Swapping the penalty to $\lambda\lVert\mathbf{w}\rVert_1 = \lambda\sum_i |w_i|$ gives Lasso regression. The L1 ball has sharp corners on the axes, so the optimum tends to land exactly on them — driving many weights to exactly zero and performing automatic feature selection. When you want a sparse, interpretable model, you reach for L1.

Cosine → retrieval and embeddings. Search engines, recommender systems, and RAG pipelines rank documents by cosine similarity between a query embedding and each document embedding. Direction encodes meaning while magnitude often encodes irrelevant things (document length, word counts), so the scale-invariant ruler is the right one. In practice systems pre-normalize every vector to unit L2 length, after which cosine similarity is just a dot product.

NumPy implementation

Let us implement all three norms, Euclidean distance, and cosine similarity from their definitions, then check each against NumPy's built-in np.linalg.norm. The ord argument selects the norm: ord=1, ord=2, and ord=np.inf. Run it:

norms_distances.py

import numpy as np

rng = np.random.default_rng(0)  # reproducible, though this example is fixed

x = np.array([3.0, -4.0])

# --- Norms from their definitions -----------------------------------------
l1 = np.sum(np.abs(x))                  # sum of absolute values
l2 = np.sqrt(np.sum(x ** 2))            # sqrt of sum of squares
linf = np.max(np.abs(x))               # largest absolute component

# Verify each against np.linalg.norm with the matching ord.
assert np.isclose(l1, np.linalg.norm(x, ord=1))         # 7.0
assert np.isclose(l2, np.linalg.norm(x, ord=2))         # 5.0
assert np.isclose(linf, np.linalg.norm(x, ord=np.inf))  # 4.0
print("norms:", l1, l2, linf)           # 7.0 5.0 4.0

# --- Euclidean distance = L2 norm of the difference -----------------------
a = np.array([1.0, 0.0])
b = np.array([1.0, 1.0])
dist = np.linalg.norm(a - b, ord=2)     # NOT norm(a) - norm(b)
assert np.isclose(dist, 1.0)
print("distance:", dist)                # 1.0

# --- Cosine similarity ----------------------------------------------------
def cosine(u, v):
  u = np.asarray(u, dtype=float)
  v = np.asarray(v, dtype=float)
  return (u @ v) / (np.linalg.norm(u) * np.linalg.norm(v))

# Scale invariance: b and 100*b give the same cosine with a.
assert np.isclose(cosine(a, b), cosine(a, 100 * b))
# Orthogonal vectors -> 0; identical direction -> 1.
assert np.isclose(cosine(np.array([1.0, 0.0]), np.array([0.0, 1.0])), 0.0)
assert np.isclose(cosine(a, 2 * a), 1.0)
print("cosine(a,b):", round(float(cosine(a, b)), 4))  # 0.7071
print("ok")

np.linalg.norm is the tool you will actually call; implementing the three definitions by hand once is how you make sure you know what its ord argument means. Note the distance line: it is the norm of the difference, never the difference of the norms.

Summary

A norm measures the size of one vector. The three you must know: $\lVert\mathbf{x}\rVert_1 = \sum_i |x_i|$ (L1, Manhattan), $\lVert\mathbf{x}\rVert_2 = \sqrt{\sum_i x_i^2}$ (L2, Euclidean, the default), and $\lVert\mathbf{x}\rVert_\infty = \max_i |x_i|$ (L∞, max).
All three are cases of the Lp norm $\left(\sum_i |x_i|^p\right)^{1/p}$ ; $p \to \infty$ recovers the max norm, and the norm is non-increasing in $p$ .
Euclidean distance is the L2 norm of the difference, $d(\mathbf{a},\mathbf{b}) = \lVert\mathbf{a}-\mathbf{b}\rVert_2$ ; every norm induces a distance this way.
Cosine similarity $\dfrac{\mathbf{a}\cdot\mathbf{b}}{\lVert a\rVert_2\lVert b\rVert_2} \in [-1,1]$ measures direction only; it is $0$ for orthogonal vectors and is scale-invariant, which is why it dominates embedding search.
ML mapping: L2 in ridge / weight decay and Euclidean kNN, L1 in Lasso for sparsity, cosine in retrieval and embeddings, L∞ in adversarial budgets.
In NumPy use np.linalg.norm(x, ord=1|2|np.inf); penalize the squared L2 norm in regularizers, and guard the zero vector before dividing for cosine.

Active recall

Answer from memory before checking the lesson:

Write the L1, L2, and L∞ norms of $\mathbf{x} = (6, -8)$ . Which is largest?
Euclidean distance between $\mathbf{a}$ and $\mathbf{b}$ equals the norm of what? Why is it wrong to compute $\lVert\mathbf{a}\rVert - \lVert\mathbf{b}\rVert$ ?
What value of cosine similarity means two vectors are orthogonal? What does $-1$ mean?
Which penalty — L1 or L2 — tends to drive weights to exactly zero, and what is that technique called?
Why do retrieval systems prefer cosine similarity over the raw dot product?

Exercises

Level ARecall & basic calculation

Level AHand calculationch12-A1

Compute an L2 norm

Compute the L2 (Euclidean) norm $\lVert\mathbf{x}\rVert_2$ of $\mathbf{x} = (3, 4)$ .

Level AHand calculationch12-A2

Compute an L1 norm

Compute the L1 (Manhattan) norm $\lVert\mathbf{x}\rVert_1$ of $\mathbf{x} = (3, -4)$ .

Level AHand calculationch12-A3

Compute an L∞ norm

Compute the L∞ (max) norm $\lVert\mathbf{x}\rVert_\infty$ of $\mathbf{x} = (-2, 5, -7, 1)$ .

Level AHand calculationch12-A4

Euclidean distance between two points

Compute the Euclidean distance $d(\mathbf{a}, \mathbf{b})$ between $\mathbf{a} = (1, 2)$ and $\mathbf{b} = (4, 6)$ .

Level AHand calculationch12-A5

Cosine of orthogonal vectors

Compute the cosine similarity of $\mathbf{a} = (1, 0)$ and $\mathbf{b} = (0, 3)$ .

Level AHand calculationch12-A6

Cosine of parallel vectors

Compute the cosine similarity of $\mathbf{a} = (2, 1)$ and $\mathbf{b} = (6, 3)$ .

Level BConceptual understanding

Level BEquation interpretationch12-B1

Ordering of the three norms

For any vector $\mathbf{x}$ , which chain of inequalities among its L1, L2, and L∞ norms always holds?

Level BML applicationch12-B2

Which penalty produces sparsity?

You want a regression model that sets many weights to exactly zero for feature selection. Do you add an L1 or an L2 penalty to the loss, and what is the resulting method called?

Level BEquation interpretationch12-B3

Distance versus difference of norms

A colleague computes the 'distance' between $\mathbf{a} = (3, 0)$ and $\mathbf{b} = (0, 3)$ as $\lVert\mathbf{a}\rVert_2 - \lVert\mathbf{b}\rVert_2$ and gets $0$ , concluding the points coincide. In one or two sentences, explain the error and give the correct Euclidean distance.

Level BML applicationch12-B4

Why cosine for retrieval

A document embedding $\mathbf{d}$ and the same document repeated twice, embedded as roughly $2\mathbf{d}$ , should be judged equally relevant to a query $\mathbf{q}$ . Explain in one or two sentences why cosine similarity gives the same score for $\mathbf{d}$ and $2\mathbf{d}$ but the raw dot product does not.

Level BML applicationch12-B5

Norm versus squared norm in regularization

Ridge regression penalizes $\lVert\mathbf{w}\rVert_2^2$ , not $\lVert\mathbf{w}\rVert_2$ . Give the main reason the squared norm is preferred as the penalty term.

Level CDerivation & implementation

Level CNumPy implementationch12-C1

Implement the general Lp norm

Implement lp_norm(x, p) for a 1-D NumPy array, returning $\left(\sum_i |x_i|^p\right)^{1/p}$ , and handle p = np.inf as the max norm. Verify against np.linalg.norm(x, ord=p) for $p = 1, 2, \infty$ on $\mathbf{x} = (3, -4)$ , then print ok.

Level CNumPy implementationch12-C2

Cosine similarity with a zero-vector guard

Implement cosine(a, b) returning $\dfrac{\mathbf{a}\cdot\mathbf{b}}{\lVert a\rVert_2\,\lVert b\rVert_2}$ , but return 0.0 if either vector is the zero vector (so it never emits nan). Clamp the result to $[-1, 1]$ . Verify it gives 1.0 for parallel vectors, 0.0 for orthogonal ones, and 0.0 for a zero input, then print ok.

Level CDerivationch12-C3

Derive cosine similarity from the dot product

Starting from the geometric form of the dot product, $\mathbf{a}\cdot\mathbf{b} = \lVert\mathbf{a}\rVert_2\,\lVert\mathbf{b}\rVert_2\cos\theta$ , derive the cosine similarity formula and explain why the result must lie in $[-1, 1]$ and why it is scale-invariant.

Level DResearch-thinking challenge

Level DPaper-reading practicech12-D1

Why L1 produces sparsity but L2 does not

Ridge (L2) and Lasso (L1) both shrink weights, yet only Lasso drives many of them to exactly zero. Using the geometry of the L1 and L2 unit balls in 2-D, explain why the L1 constraint favors solutions on the axes (sparse) while the L2 constraint does not. Then name one concrete situation where you would deliberately prefer L2 over L1.