Part 3 · Core Linear AlgebraChapter 1265 min

Norms, Distances, and Similarity

Measuring size, closeness, and alignment

Prerequisites

Learning objectives

  • Define the L1, L2, and L∞ norms and compute each
  • Relate Euclidean distance to the L2 norm of a difference
  • Compute the angle and cosine similarity between vectors
  • Explain where each metric appears (kNN, retrieval, regularization)

Why we need to measure vectors

The last chapter turned every object in machine learning — a word, an image, a user — into a vector. But a list of numbers on its own is inert. The moment a model does anything useful it is measuring: how big is this weight vector, how far is this point from that cluster centroid, how similar is this query embedding to that document embedding. Regularization penalizes size. Nearest-neighbor search ranks by distance. Retrieval ranks by similarity. All three are the same question asked with different rulers.

This chapter installs those rulers. A norm measures the size of a single vector; a distance measures how far apart two vectors are; cosine similarity measures whether two vectors point the same way. They are not interchangeable — choosing L1 versus L2, or distance versus cosine, changes what your model rewards. By the end you will know the three workhorse norms, the one family that contains them, and exactly which one shows up in ridge regression, Lasso, kNN, and embedding search.

Intuition: three ways to walk to the corner store

Stand at the origin and walk to the point x=(3,4)\mathbf{x} = (3, 4). How far did you go? The honest answer is "it depends on how you're allowed to move."

  • If you can fly straight to it, you travel 32+42=5\sqrt{3^2 + 4^2} = 5. That is the L2 (Euclidean) length — the ordinary "as the crow flies" distance.
  • If you must follow a city grid, walking only along axes, you cover 3+4=7|3| + |4| = 7 blocks. That is the L1 (Manhattan / taxicab) length.
  • If someone asks only for the single largest move along any one axis, the answer is max(3,4)=4\max(|3|, |4|) = 4. That is the L∞ (Chebyshev / max) length.

Same vector, three legitimate sizes. Each norm weights the components differently: L2 rewards spreading magnitude across components, L1 counts total absolute displacement, and L∞ cares only about the single worst component. Those three biases are exactly why they end up in different corners of ML.

Interactive LabVector Playground
Loading interactive lab…

Drag a\mathbf{a} in the lab and watch its length change. The readout shows the Euclidean (L2) length; hold the geometry in mind as we generalize it below.

Formal definitions

A norm is a function :RnR\lVert\cdot\rVert : \mathbb{R}^n \to \mathbb{R} that assigns a non-negative "size" to a vector, is zero only for the zero vector, scales as cx=cx\lVert c\mathbf{x}\rVert = |c|\,\lVert\mathbf{x}\rVert, and obeys the triangle inequality a+ba+b\lVert\mathbf{a}+\mathbf{b}\rVert \le \lVert\mathbf{a}\rVert + \lVert\mathbf{b}\rVert. The three norms below are the ones you will meet daily.

These three are special cases of one family.

Distance is not a new idea — it is a norm applied to a difference.

Finally, the measure of direction rather than size.

Two vectors are orthogonal exactly when ab=0\mathbf{a}\cdot\mathbf{b} = 0, equivalently cosθ=0\cos\theta = 0, equivalently θ=90°\theta = 90°. Orthogonality is the geometric statement "these two directions share nothing."

A numerical example

Where cosine comes from

Cosine similarity is not an arbitrary formula — it falls straight out of the geometric form of the dot product from the previous chapter.

Where each ruler appears in ML

The choice of norm is a modeling decision. Three canonical places:

L2 → ridge regression and weight decay. Adding the penalty λw22=λiwi2\lambda\lVert\mathbf{w}\rVert_2^2 = \lambda\sum_i w_i^2 to a loss shrinks all weights smoothly toward zero — this is ridge regression, and the identical term under the name weight decay regularizes almost every neural network. Because squaring punishes large weights hardest, L2 discourages any single weight from dominating but rarely drives one to exactly zero.

L2 → Euclidean k-nearest-neighbors. kNN classifies a point by the labels of its nearest neighbors under d(a,b)=ab2d(\mathbf{a}, \mathbf{b}) = \lVert\mathbf{a} - \mathbf{b}\rVert_2. The straight-line ruler defines "nearest," so feature scaling matters enormously: a feature measured in thousands will dominate the sum of squares unless standardized.

L1 → Lasso and sparsity. Swapping the penalty to λw1=λiwi\lambda\lVert\mathbf{w}\rVert_1 = \lambda\sum_i |w_i| gives Lasso regression. The L1 ball has sharp corners on the axes, so the optimum tends to land exactly on them — driving many weights to exactly zero and performing automatic feature selection. When you want a sparse, interpretable model, you reach for L1.

Cosine → retrieval and embeddings. Search engines, recommender systems, and RAG pipelines rank documents by cosine similarity between a query embedding and each document embedding. Direction encodes meaning while magnitude often encodes irrelevant things (document length, word counts), so the scale-invariant ruler is the right one. In practice systems pre-normalize every vector to unit L2 length, after which cosine similarity is just a dot product.

NumPy implementation

Let us implement all three norms, Euclidean distance, and cosine similarity from their definitions, then check each against NumPy's built-in np.linalg.norm. The ord argument selects the norm: ord=1, ord=2, and ord=np.inf. Run it:

norms_distances.py

np.linalg.norm is the tool you will actually call; implementing the three definitions by hand once is how you make sure you know what its ord argument means. Note the distance line: it is the norm of the difference, never the difference of the norms.

Summary

  • A norm measures the size of one vector. The three you must know: x1=ixi\lVert\mathbf{x}\rVert_1 = \sum_i |x_i| (L1, Manhattan), x2=ixi2\lVert\mathbf{x}\rVert_2 = \sqrt{\sum_i x_i^2} (L2, Euclidean, the default), and x=maxixi\lVert\mathbf{x}\rVert_\infty = \max_i |x_i| (L∞, max).
  • All three are cases of the Lp norm (ixip)1/p\left(\sum_i |x_i|^p\right)^{1/p}; pp \to \infty recovers the max norm, and the norm is non-increasing in pp.
  • Euclidean distance is the L2 norm of the difference, d(a,b)=ab2d(\mathbf{a},\mathbf{b}) = \lVert\mathbf{a}-\mathbf{b}\rVert_2; every norm induces a distance this way.
  • Cosine similarity aba2b2[1,1]\dfrac{\mathbf{a}\cdot\mathbf{b}}{\lVert a\rVert_2\lVert b\rVert_2} \in [-1,1] measures direction only; it is 00 for orthogonal vectors and is scale-invariant, which is why it dominates embedding search.
  • ML mapping: L2 in ridge / weight decay and Euclidean kNN, L1 in Lasso for sparsity, cosine in retrieval and embeddings, L∞ in adversarial budgets.
  • In NumPy use np.linalg.norm(x, ord=1|2|np.inf); penalize the squared L2 norm in regularizers, and guard the zero vector before dividing for cosine.

Active recall

Answer from memory before checking the lesson:

  1. Write the L1, L2, and L∞ norms of x=(6,8)\mathbf{x} = (6, -8). Which is largest?
  2. Euclidean distance between a\mathbf{a} and b\mathbf{b} equals the norm of what? Why is it wrong to compute ab\lVert\mathbf{a}\rVert - \lVert\mathbf{b}\rVert?
  3. What value of cosine similarity means two vectors are orthogonal? What does 1-1 mean?
  4. Which penalty — L1 or L2 — tends to drive weights to exactly zero, and what is that technique called?
  5. Why do retrieval systems prefer cosine similarity over the raw dot product?

Exercises

Level ARecall & basic calculation

Level BConceptual understanding

Level CDerivation & implementation

Level DResearch-thinking challenge