Norms, Distances, and Similarity
Measuring size, closeness, and alignment
Prerequisites
Learning objectives
- Define the L1, L2, and L∞ norms and compute each
- Relate Euclidean distance to the L2 norm of a difference
- Compute the angle and cosine similarity between vectors
- Explain where each metric appears (kNN, retrieval, regularization)
Why we need to measure vectors
The last chapter turned every object in machine learning — a word, an image, a user — into a vector. But a list of numbers on its own is inert. The moment a model does anything useful it is measuring: how big is this weight vector, how far is this point from that cluster centroid, how similar is this query embedding to that document embedding. Regularization penalizes size. Nearest-neighbor search ranks by distance. Retrieval ranks by similarity. All three are the same question asked with different rulers.
This chapter installs those rulers. A norm measures the size of a single vector; a distance measures how far apart two vectors are; cosine similarity measures whether two vectors point the same way. They are not interchangeable — choosing L1 versus L2, or distance versus cosine, changes what your model rewards. By the end you will know the three workhorse norms, the one family that contains them, and exactly which one shows up in ridge regression, Lasso, kNN, and embedding search.
Intuition: three ways to walk to the corner store
Stand at the origin and walk to the point . How far did you go? The honest answer is "it depends on how you're allowed to move."
- If you can fly straight to it, you travel . That is the L2 (Euclidean) length — the ordinary "as the crow flies" distance.
- If you must follow a city grid, walking only along axes, you cover blocks. That is the L1 (Manhattan / taxicab) length.
- If someone asks only for the single largest move along any one axis, the answer is . That is the L∞ (Chebyshev / max) length.
Same vector, three legitimate sizes. Each norm weights the components differently: L2 rewards spreading magnitude across components, L1 counts total absolute displacement, and L∞ cares only about the single worst component. Those three biases are exactly why they end up in different corners of ML.
Drag in the lab and watch its length change. The readout shows the Euclidean (L2) length; hold the geometry in mind as we generalize it below.
Formal definitions
A norm is a function that assigns a non-negative "size" to a vector, is zero only for the zero vector, scales as , and obeys the triangle inequality . The three norms below are the ones you will meet daily.
These three are special cases of one family.
| Symbol | Meaning | Type | Shape | Role |
|---|---|---|---|---|
| Lp norm of a vector | scalar | 1 | variable | |
| L1 norm: sum of absolute values | scalar | 1 | variable | |
| L2 norm: Euclidean length | scalar | 1 | variable | |
| L∞ norm: largest absolute component | scalar | 1 | variable | |
| Euclidean distance between two vectors | scalar | 1 | variable | |
| Cosine similarity of two vectors, in [-1, 1] | scalar | 1 | variable | |
| Order of the norm (p ≥ 1; ∞ allowed) | scalar | 1 | fixed |
Distance is not a new idea — it is a norm applied to a difference.
Finally, the measure of direction rather than size.
Two vectors are orthogonal exactly when , equivalently , equivalently . Orthogonality is the geometric statement "these two directions share nothing."
A numerical example
Where cosine comes from
Cosine similarity is not an arbitrary formula — it falls straight out of the geometric form of the dot product from the previous chapter.
Where each ruler appears in ML
The choice of norm is a modeling decision. Three canonical places:
L2 → ridge regression and weight decay. Adding the penalty to a loss shrinks all weights smoothly toward zero — this is ridge regression, and the identical term under the name weight decay regularizes almost every neural network. Because squaring punishes large weights hardest, L2 discourages any single weight from dominating but rarely drives one to exactly zero.
L2 → Euclidean k-nearest-neighbors. kNN classifies a point by the labels of its nearest neighbors under . The straight-line ruler defines "nearest," so feature scaling matters enormously: a feature measured in thousands will dominate the sum of squares unless standardized.
L1 → Lasso and sparsity. Swapping the penalty to gives Lasso regression. The L1 ball has sharp corners on the axes, so the optimum tends to land exactly on them — driving many weights to exactly zero and performing automatic feature selection. When you want a sparse, interpretable model, you reach for L1.
Cosine → retrieval and embeddings. Search engines, recommender systems, and RAG pipelines rank documents by cosine similarity between a query embedding and each document embedding. Direction encodes meaning while magnitude often encodes irrelevant things (document length, word counts), so the scale-invariant ruler is the right one. In practice systems pre-normalize every vector to unit L2 length, after which cosine similarity is just a dot product.
NumPy implementation
Let us implement all three norms, Euclidean distance, and cosine similarity from
their definitions, then check each against NumPy's built-in np.linalg.norm. The
ord argument selects the norm: ord=1, ord=2, and ord=np.inf. Run it:
np.linalg.norm is the tool you will actually call; implementing the three
definitions by hand once is how you make sure you know what its ord argument
means. Note the distance line: it is the norm of the difference, never the
difference of the norms.
Summary
- A norm measures the size of one vector. The three you must know: (L1, Manhattan), (L2, Euclidean, the default), and (L∞, max).
- All three are cases of the Lp norm ; recovers the max norm, and the norm is non-increasing in .
- Euclidean distance is the L2 norm of the difference, ; every norm induces a distance this way.
- Cosine similarity measures direction only; it is for orthogonal vectors and is scale-invariant, which is why it dominates embedding search.
- ML mapping: L2 in ridge / weight decay and Euclidean kNN, L1 in Lasso for sparsity, cosine in retrieval and embeddings, L∞ in adversarial budgets.
- In NumPy use
np.linalg.norm(x, ord=1|2|np.inf); penalize the squared L2 norm in regularizers, and guard the zero vector before dividing for cosine.
Active recall
Answer from memory before checking the lesson:
- Write the L1, L2, and L∞ norms of . Which is largest?
- Euclidean distance between and equals the norm of what? Why is it wrong to compute ?
- What value of cosine similarity means two vectors are orthogonal? What does mean?
- Which penalty — L1 or L2 — tends to drive weights to exactly zero, and what is that technique called?
- Why do retrieval systems prefer cosine similarity over the raw dot product?
Exercises
Level ARecall & basic calculation
Compute an L2 norm
Compute the L2 (Euclidean) norm of .
Compute an L1 norm
Compute the L1 (Manhattan) norm of .
Compute an L∞ norm
Compute the L∞ (max) norm of .
Euclidean distance between two points
Compute the Euclidean distance between and .
Cosine of orthogonal vectors
Compute the cosine similarity of and .
Cosine of parallel vectors
Compute the cosine similarity of and .
Level BConceptual understanding
Ordering of the three norms
For any vector , which chain of inequalities among its L1, L2, and L∞ norms always holds?
Which penalty produces sparsity?
You want a regression model that sets many weights to exactly zero for feature selection. Do you add an L1 or an L2 penalty to the loss, and what is the resulting method called?
Distance versus difference of norms
A colleague computes the 'distance' between and as and gets , concluding the points coincide. In one or two sentences, explain the error and give the correct Euclidean distance.
Why cosine for retrieval
A document embedding and the same document repeated twice, embedded as roughly , should be judged equally relevant to a query . Explain in one or two sentences why cosine similarity gives the same score for and but the raw dot product does not.
Norm versus squared norm in regularization
Ridge regression penalizes , not . Give the main reason the squared norm is preferred as the penalty term.
Level CDerivation & implementation
Implement the general Lp norm
Implement lp_norm(x, p) for a 1-D NumPy array, returning , and handle p = np.inf as the max norm. Verify against np.linalg.norm(x, ord=p) for on , then print ok.
Cosine similarity with a zero-vector guard
Implement cosine(a, b) returning , but return 0.0 if either vector is the zero vector (so it never emits nan). Clamp the result to . Verify it gives 1.0 for parallel vectors, 0.0 for orthogonal ones, and 0.0 for a zero input, then print ok.
Derive cosine similarity from the dot product
Starting from the geometric form of the dot product, , derive the cosine similarity formula and explain why the result must lie in and why it is scale-invariant.
Level DResearch-thinking challenge
Why L1 produces sparsity but L2 does not
Ridge (L2) and Lasso (L1) both shrink weights, yet only Lasso drives many of them to exactly zero. Using the geometry of the L1 and L2 unit balls in 2-D, explain why the L1 constraint favors solutions on the axes (sparse) while the L2 constraint does not. Then name one concrete situation where you would deliberately prefer L2 over L1.