Systems of Linear Equations

Why solving linear systems is a core ML primitive

The moment you fit a model, you are almost always solving a system of linear equations under the hood. Ordinary least-squares regression? Its optimum is the solution of a linear system. The closed-form step of many classical methods — ridge regression, Gaussian processes, the Newton step of an optimizer — reduces to "solve $A\mathbf{x} = \mathbf{b}$ ." Even when a modern network is trained by gradient descent rather than a direct solve, the local pictures it moves through are linear systems.

So the object of this chapter is the humble system of linear equations, and the one algorithm that solves it in full generality: Gaussian elimination. By the end you should be able to write a system in matrix form, run elimination by hand, read off which of the three possible outcomes you are in, and connect all of it to the normal equations that define linear regression.

Intuition: each equation is a constraint, the solution is where they meet

A single linear equation in two unknowns, say $2x + y = 5$ , is not one point — it is a whole line of $(x, y)$ pairs that satisfy it. A second equation, $x - y = 1$ , is another line. Solving the system

\begin{aligned} 2x + y &= 5 \\ x - y &= 1 \end{aligned}

means finding the points that satisfy both at once — the intersection of the two lines. Two lines in a plane usually cross at exactly one point, and indeed here $x = 2,\ y = 1$ is the unique meeting point.

That geometric picture is the whole story, and it already predicts the three things that can happen. Two lines can cross once (one solution), be the same line (infinitely many solutions), or be parallel but distinct (no solution). In three unknowns each equation is a plane, and we intersect planes instead of lines — but the trichotomy is identical.

Interactive LabLinear-System Solver

Loading interactive lab…

Drag the equations above and watch the intersection move. Try to make the two lines parallel: the solution point shoots off to infinity and then vanishes — that is the "no solution" case appearing geometrically.

Formal definitions

Symbol	Meaning	Type	Shape	Role
$A$	Coefficient matrix	matrix	m×n	fixed
$\mathbf{x}$	Vector of unknowns (what we solve for)	vector	n×1	variable
$\mathbf{b}$	Right-hand side (the constants)	vector	m×1	fixed
$[A \mid \mathbf{b}]$	Augmented matrix (A with b appended)	matrix	m×(n+1)	derived
$\operatorname{rank}(A)$	Number of independent rows/pivots	integer	1	derived

The three row operations are the only moves allowed, and each one produces an equivalent system — same solution set as before:

The goal of Gaussian elimination is to use operation 3 (with occasional swaps) to drive entries below each pivot to zero — forward elimination — until the matrix is upper-triangular (row echelon form), then solve from the bottom row upward by back-substitution. If you keep going and also clear entries above each pivot and scale every pivot to $1$ , you reach the unique reduced row echelon form (RREF), which reads off the solution directly.

Worked example: solve a 2×2 system by elimination

forward elimination + back-substitution

Solve

\begin{aligned} 2x + y &= 5 \\ x - y &= 1. \end{aligned}

Write the augmented matrix: $\left[\begin{array}{cc|c} 2 & 1 & 5 \\ 1 & -1 & 1 \end{array}\right].$

Forward elimination. Eliminate $x$ from row 2 using row 1 as the pivot row. The multiplier is $\tfrac{1}{2}$ (row 2's leading entry over the pivot), so $R_2 \to R_2 - \tfrac{1}{2}R_1$ : $\left[\begin{array}{cc|c} 2 & 1 & 5 \\ 0 & -\tfrac{3}{2} & -\tfrac{3}{2} \end{array}\right].$ The matrix is now upper-triangular — this is row echelon form.

Back-substitution. The last row says $-\tfrac{3}{2}\,y = -\tfrac{3}{2}$ , so $y = 1$ . Substitute into row 1: $2x + (1) = 5 \Rightarrow 2x = 4 \Rightarrow x = 2$ .

Solution: $\mathbf{x} = (2, 1)$ . Check: $2(2) + 1 = 5$ ✓ and $2 - 1 = 1$ ✓. Because we produced exactly one pivot per column, this system has a unique solution — the single intersection point we saw in the lab.

Continuing to full RREF would scale both pivots to $1$ and clear the entry above the second pivot, ending at $\left[\begin{array}{cc|c} 1 & 0 & 2 \\ 0 & 1 & 1 \end{array}\right],$ whose right column is the solution $(2, 1)$ .

The three outcomes, and how rank decides

Every linear system lands in exactly one of three cases. Which one is governed by rank — the number of pivots (independent rows) — of $A$ compared with the rank of the augmented matrix $[A \mid \mathbf{b}]$ and the number of unknowns $n$ .

Solvability trichotomy

For $A\mathbf{x} = \mathbf{b}$ with $n$ unknowns:

No solution (inconsistent) when $\operatorname{rank}(A) < \operatorname{rank}([A \mid \mathbf{b}])$ . Elimination produces a row $[\,0\ \cdots\ 0 \mid c\,]$ with $c \neq 0$ , i.e. " $0 = c$ ", a contradiction.
Exactly one solution when $\operatorname{rank}(A) = \operatorname{rank}([A \mid \mathbf{b}]) = n$ . Every unknown has its own pivot; nothing is free.
Infinitely many solutions when $\operatorname{rank}(A) = \operatorname{rank}([A \mid \mathbf{b}]) < n$ . There are $n - \operatorname{rank}(A)$ free variables, each a direction you can slide along and stay a solution.

Geometrically, in two unknowns:

| Outcome | Rank picture | Two lines look like | | --- | --- | --- | | Unique | $\operatorname{rank}(A) = n = 2$ | crossing at one point | | Infinite | $\operatorname{rank}(A) = 1 < 2$ , consistent | the same line (coincident) | | None | inconsistent | parallel but distinct |

A square matrix $A$ ( $m = n$ ) gives a unique solution for every $\mathbf{b}$ exactly when $\operatorname{rank}(A) = n$ — equivalently when $A$ is invertible, equivalently when $\det A \neq 0$ . When $\det A = 0$ the matrix is singular: depending on $\mathbf{b}$ you get either no solution or infinitely many, never a unique one.

ML use case: the normal equations of least squares

Here is where this pays off. In linear regression we have data $X \in \mathbb{R}^{m \times n}$ ( $m$ examples, $n$ features) and targets $\mathbf{y} \in \mathbb{R}^m$ , and we want weights $\boldsymbol{\theta}$ minimizing the squared error $\lVert X\boldsymbol{\theta} - \mathbf{y}\rVert^2$ . When $m > n$ this system $X\boldsymbol{\theta} = \mathbf{y}$ is overdetermined — more equations than unknowns, typically no exact solution. Instead of an exact hit we ask for the best fit, and the minimizer satisfies the normal equations:

X^\top X\,\boldsymbol{\theta} = X^\top \mathbf{y}

(9.1)

This is once again a square linear system $A\boldsymbol{\theta} = \mathbf{b}$ with $A = X^\top X$ (an $n \times n$ matrix) and $\mathbf{b} = X^\top \mathbf{y}$ — and it is solved by exactly the Gaussian elimination of this chapter. The rank story carries over too: $X^\top X$ is invertible precisely when $X$ has full column rank (independent features). If two features are perfectly collinear, $X^\top X$ is singular, elimination stalls, and the weights are not uniquely determined — the linear-algebra fingerprint of multicollinearity. We derive equation 9.1 properly in the regression chapter; for now, notice that "fit a line to data" is "solve a linear system."

NumPy: solving systems, and what goes wrong

The workhorse is np.linalg.solve(A, b), which runs a pivoted elimination (LU factorization) internally — never form the inverse by hand. It solves square, non-singular systems and raises LinAlgError on a singular matrix. Run this:

solve_systems.py

import numpy as np

np.random.seed(0)  # determinism; this example does not use randomness

# --- A well-posed 3x3 system A x = b -------------------------------------
A = np.array([[2.0, 1.0, -1.0],
            [-3.0, -1.0, 2.0],
            [-2.0, 1.0, 2.0]])
b = np.array([8.0, -11.0, -3.0])

x = np.linalg.solve(A, b)              # pivoted elimination under the hood
print("solution:", np.round(x, 3))    # [ 2.  3. -1.]

# Verify by substitution: A x should reproduce b.
assert np.allclose(A @ x, b), "A x must equal b"

# rank == number of unknowns  =>  unique solution
print("rank(A):", np.linalg.matrix_rank(A), "of", A.shape[1], "unknowns")

# --- A singular system: rows are linearly dependent ----------------------
S = np.array([[1.0, 2.0],
            [2.0, 4.0]])            # row 2 = 2 * row 1  => det = 0
c = np.array([3.0, 7.0])             # inconsistent right-hand side
try:
  np.linalg.solve(S, c)
  print("unexpectedly solved")
except np.linalg.LinAlgError:
  # Singular matrix: no unique solution exists, so solve() refuses.
  print("caught singular: det =", round(float(np.linalg.det(S)), 3))

print("ok")

The first block confirms the promised trichotomy from code: matrix_rank(A) equals the number of unknowns, so the solution is unique, and A @ x reproduces b to floating-point tolerance. The second block shows the failure mode you must handle in practice — a singular $A$ makes solve raise, because there is no unique answer to return.

Summary

A system of linear equations is written $A\mathbf{x} = \mathbf{b}$ ; each row is a constraint, and a solution is a point satisfying all of them at once.
Gaussian elimination uses three row operations on $[A \mid \mathbf{b}]$ — swap, scale, add-a-multiple — to reach row echelon form (forward elimination), then solves bottom-up (back-substitution). Pushing to RREF reads the solution off directly.
Exactly three outcomes are possible — unique, infinitely many, or no solution — corresponding geometrically to lines/planes that cross once, coincide, or are parallel.
Rank decides it: unique iff $\operatorname{rank}(A) = \operatorname{rank}([A \mid \mathbf{b}]) = n$ ; inconsistent iff the two ranks differ; infinite iff they agree but are $< n$ (giving free variables).
Least-squares regression solves the normal equations $X^\top X \boldsymbol{\theta} = X^\top \mathbf{y}$ — a linear system — so "fitting a model" is "solving a system."
In NumPy use np.linalg.solve(A, b); it raises LinAlgError on singular $A$ , and watch np.linalg.cond(A) for the near-singular / ill-conditioned trap.

Active recall

Answer from memory before checking the lesson:

Write the system $3x + 2y = 12,\ x - y = 1$ in matrix form $A\mathbf{x} = \mathbf{b}$ . What are $A$ and $\mathbf{b}$ ?
During elimination a row becomes $[\,0\ 0\ 0 \mid 4\,]$ . What does this tell you about the number of solutions, and why?
A system in $n = 3$ unknowns has $\operatorname{rank}(A) = 2$ and is consistent. How many free variables are there, and how many solutions?
Which linear system does linear regression solve, and when is its coefficient matrix $X^\top X$ singular?

Exercises

Level ARecall & basic calculation

Level AHand calculationch09-A1

Solve a 2×2 system by elimination

Solve the system

\begin{aligned} 2x + y &= 5 \\ x - y &= 1. \end{aligned}

Enter the solution as x, y.

Level AHand calculationch09-A2

Another 2×2 solve

Solve

\begin{aligned} x + y &= 4 \\ 2x - y &= 5. \end{aligned}

Enter as x, y.

Level AEquation interpretationch09-A3

Write a system in matrix form

The system $3x + 2y = 12,\ x - y = 1$ is written as $A\mathbf{x} = \mathbf{b}$ . What is the right-hand side vector $\mathbf{b}$ ? Enter as b1, b2.

Level AShape reasoningch09-A4

Count the free variables

A consistent system in $n = 3$ unknowns has $\operatorname{rank}(A) = 2$ . How many free variables does its solution set have?

Level AEquation interpretationch09-A5

Reading an inconsistent row

After elimination, a row of the augmented matrix becomes $[\,0\ 0\ 0 \mid 5\,]$ . How many solutions does the system have?

Level AEquation interpretationch09-A6

Determinant and uniqueness

For a square system $A\mathbf{x} = \mathbf{b}$ , a unique solution for every $\mathbf{b}$ is guaranteed exactly when $\det A$ is which value?

Level BConceptual understanding

Level BGraph interpretationch09-B1

Classify: parallel lines

The two equations of a 2×2 system plot as parallel but distinct lines. How many solutions does the system have?

Level BEquation interpretationch09-B2

Rank decides the outcome

A system has $n = 3$ unknowns with $\operatorname{rank}(A) = \operatorname{rank}([A \mid \mathbf{b}]) = 3$ . Which outcome holds?

Level BShape reasoningch09-B3

Shapes in the normal equations

In least squares the design matrix is $X \in \mathbb{R}^{m \times n}$ with $m$ examples and $n$ features. In the normal equations $X^\top X\,\boldsymbol{\theta} = X^\top \mathbf{y}$ , what is the shape of the coefficient matrix $X^\top X$ ?

Level BML applicationch09-B4

When is XᵀX singular?

Explain, in one or two sentences, why $X^\top X$ becomes singular when two feature columns of $X$ are exactly proportional (perfectly collinear), and what this means for the regression weights.

Level CDerivation & implementation

Level CNumPy implementationch09-C1

Solve a 3×3 system in NumPy

Use np.linalg.solve to solve

\begin{aligned} 2x + y - z &= 8 \\ -3x - y + 2z &= -11 \\ -2x + y + 2z &= -3. \end{aligned}

Verify the solution with np.allclose(A @ x, b), then print ok.

Level CDerivationch09-C2

Elimination to RREF by hand

Solve

\begin{aligned} x + 2y &= 4 \\ 3x + 4y &= 10 \end{aligned}

by Gaussian elimination on the augmented matrix, showing the forward elimination step and back-substitution. Give the final solution.

Level CNumPy implementationch09-C3

Detect a singular system in NumPy

Build the singular system $\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}\mathbf{x} = \begin{bmatrix} 3 \\ 7 \end{bmatrix}$ . Attempt np.linalg.solve inside a try/except np.linalg.LinAlgError, report the rank of $A$ versus the number of unknowns, and print ok.

Level DResearch-thinking challenge

Level DPaper-reading practicech09-D1

Why not just invert XᵀX?

Textbooks write the least-squares weights as $\boldsymbol{\theta} = (X^\top X)^{-1} X^\top \mathbf{y}$ , yet mature libraries never form that inverse — they call a solver like np.linalg.lstsq. Explain (a) why explicitly inverting $X^\top X$ is numerically risky, referencing the condition number, and (b) what regularization (ridge, $X^\top X + \lambda I$ ) does to solvability and conditioning.

Systems of Linear Equations

Prerequisites

Learning objectives

Why solving linear systems is a core ML primitive

Intuition: each equation is a constraint, the solution is where they meet

Formal definitions

Worked example: solve a 2×2 system by elimination

The three outcomes, and how rank decides

ML use case: the normal equations of least squares

NumPy: solving systems, and what goes wrong

Summary

Active recall

Exercises

Level ARecall & basic calculation

Solve a 2×2 system by elimination

Another 2×2 solve

Write a system in matrix form

Count the free variables

Reading an inconsistent row

Determinant and uniqueness

Level BConceptual understanding

Classify: parallel lines

Rank decides the outcome

Shapes in the normal equations

When is XᵀX singular?

Level CDerivation & implementation

Solve a 3×3 system in NumPy

Elimination to RREF by hand

Detect a singular system in NumPy

Level DResearch-thinking challenge

Why not just invert XᵀX?

Related lessons