Paper-Equation Assessment

Decode four equations of the kind ML papers assume you can read at a glance — MSE, softmax, cosine similarity, and a linear/attention layer — by naming every symbol, its type, and its dimensions, then rewriting, exemplifying, and implementing each.

Paper-Equation Assessment

Below are four equations of the kind that appear, unexplained, in real ML papers. For each equation, produce a short write-up with these six parts:

  1. Symbols & types — name every symbol and state whether it is a scalar, vector, matrix, or index, with its dimensions.
  2. In English — rewrite the equation as one or two plain-English sentences.
  3. Tiny example — plug in small numbers and compute the result by hand.
  4. Pseudocode — a few lines of language-agnostic pseudocode.
  5. NumPy — a vectorized NumPy implementation (a fenced code block).
  6. ML purpose — one or two sentences on where and why this appears in machine learning.

Equation 1 — Mean squared error

L(θ)=1ni=1n(y^iyi)2 L(\theta) = \frac{1}{n}\sum_{i=1}^{n} \left( \hat{y}_i - y_i \right)^2

Here y^i=f(xi;θ)\hat{y}_i = f(x_i; \theta) is the model's prediction for example ii and yiy_i is the target.

Equation 2 — Softmax

softmax(z)i=ezij=1Kezj \text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

where zRKz \in \mathbb{R}^{K} is a vector of logits.

Equation 3 — Cosine similarity

cos(θ)=uvu2v2 \cos(\theta) = \frac{u \cdot v}{\lVert u \rVert_2 \, \lVert v \rVert_2}

for two nonzero vectors u,vRdu, v \in \mathbb{R}^{d}.

Equation 4 — A linear layer and attention scores

h=Wx+b,S=QKdk h = W x + b, \qquad S = \frac{Q K^\top}{\sqrt{d_k}}

where WRm×dW \in \mathbb{R}^{m \times d}, xRdx \in \mathbb{R}^{d}, bRmb \in \mathbb{R}^{m}, and for attention QRn×dkQ \in \mathbb{R}^{n \times d_k}, KRn×dkK \in \mathbb{R}^{n \times d_k}.

For Equation 4, pay special attention to the shapes: give the shape of hh, the shape of SS, and explain the role of the dk\sqrt{d_k} scaling.

Marking rubric

  • Symbols & types — every symbol named with the correct type (scalar/vector/matrix/index) and dimensions; sum indices and their ranges identified.

  • Plain English — each equation restated accurately in words, capturing what is computed and over what.

  • Tiny example — concrete small numbers plugged in and computed correctly (e.g. MSE of a 2-example case, softmax of a length-2 or length-3 logit vector).

  • Pseudocode — correct, readable, language-agnostic steps that match the math.

  • NumPy — vectorized and correct; softmax uses the max-subtraction stability trick; cosine handles the norms correctly.

  • Shapes (Eq. 4)hh has shape (m,)(m,) and SS has shape (n,n)(n,n); the dk\sqrt{d_k} scaling is explained as keeping the dot-product variance stable so softmax gradients do not vanish.

  • ML purpose — each equation correctly tied to its use (regression loss, converting logits to probabilities, similarity/retrieval, linear projection and attention).