Common ML Equations · ML Math Foundations

Models & layers

Name	Equation	One-line meaning
Linear model	$\hat{y} = w^\top x + b$	Weighted sum of features plus a bias.
Dense / affine layer	$h = Wx + b$	Matrix maps input $x$ to a new feature vector; `(d_out, d_in) @ (d_in,)`.
Sigmoid	$\sigma(z) = \dfrac{1}{1 + e^{-z}}$	Squashes a real number into $(0, 1)$ (a probability).
Softmax	$\mathrm{softmax}(z)_i = \dfrac{e^{z_i}}{\sum_j e^{z_j}}$	Turns a score vector into a probability distribution.
Attention scores (shape)	$QK^\top$	`(n, d) @ (d, m) = (n, m)`: score of each query against each key.

Name	Equation	One-line meaning
MSE	$\dfrac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2$	Average squared error for regression.
Binary cross-entropy	$-\big[\, y \ln \hat{y} + (1 - y) \ln(1 - \hat{y}) \,\big]$	Penalizes confident wrong probabilities for a 0/1 label.
L2 / ridge penalty	$\lambda \lVert w \rVert_2^2 = \lambda \sum_j w_j^2$	Shrinks weights toward zero to fight overfitting.
Cosine similarity	$\dfrac{a \cdot b}{\lVert a \rVert_2 \, \lVert b \rVert_2}$	Angle-based similarity in $[-1, 1]$ , scale-invariant.

Name	Equation	One-line meaning
Gradient-descent update	$\theta \leftarrow \theta - \eta \, \nabla_\theta L$	Step downhill along the loss gradient with learning rate $\eta$ .
Gradient	$\nabla_\theta L = \left( \dfrac{\partial L}{\partial \theta_1}, \ldots, \dfrac{\partial L}{\partial \theta_k} \right)$	Vector of partials; points toward steepest increase of $L$ .