Neural networks: foundations and mathematics · 02 / 09

Essential linear algebra

Vectors, dot products and matrices: exactly what you need to speak neural-network language.

You saw in the previous chapter that the neuron formula reads in vector notation as $y = f(\mathbf{w} \cdot \mathbf{x} + b)$ . This chapter sets up the mathematical bricks hiding behind that notation. The goal: that you read $\mathbf{w} \cdot \mathbf{x}$ without flinching and that you understand what it says geometrically.

The vector, an ordered list of numbers

Definition

A vector of dimension $n$ is an ordered list of $n$ real numbers. We typically write it with a bold letter, in parentheses or square brackets:

\mathbf{x} = (x_1, x_2, \dots, x_n) \in \mathbb{R}^n

The symbol $\mathbb{R}^n$ reads “R to the n” and denotes the set of all ordered lists of $n$ real numbers. Each $x_i$ is called the $i$ -th coordinate or component of the vector.

Why vectors in machine learning

Anything you can describe as a list of numbers is a vector. A few examples:

A 28×28 greyscale image of a handwritten digit: a vector of dimension $784$ (each pixel becomes a value between 0 and 1).
A medical patient described by age, blood pressure, glycemia, cholesterol: a vector of dimension 4.
The inputs of a neuron $x_1, x_2, x_3$ seen in chapter 1: a vector of dimension 3.

A neuron’s weights also form a vector, of the same dimension as its inputs. That correspondence is what makes the dot product possible.

The dot product

Definition

The dot product of two vectors $\mathbf{x}, \mathbf{w} \in \mathbb{R}^n$ is the real number:

\mathbf{x} \cdot \mathbf{w} = \sum_{i=1}^{n} x_i w_i = x_1 w_1 + x_2 w_2 + \dots + x_n w_n

It is an operation that takes two vectors and returns a single number. Read it “x dot w” or “dot product of x and w”.

Worked example

Reusing the referee example. It is exactly the same weighted sum as in chapter 1, written this time in vector notation:

\mathbf{x} = (1, 0, 1), \quad \mathbf{w} = (0.8, 0.5, 0.9)

The dot product reads:

\mathbf{x} \cdot \mathbf{w} = 1 \times 0.8 + 0 \times 0.5 + 1 \times 0.9 = 1.7

That is the neuron’s weighted sum, without the bias. Adding $b = -0.5$ gives $z = 1.7 - 0.5 = 1.2$ , exactly as in the previous chapter.

See two vectors interacting

The component below draws two vectors $\mathbf{x}$ and $\mathbf{w}$ in the plane. Move the sliders and watch three things at once: the dot product changes, but so do the norm and the angle between them. When the angle gets close to $90°$ , the dot product drops to zero: the vectors become orthogonal .

x₁ = 1.20x₂ = 0.60

w₁ = 0.60w₂ = 1.20

Dot product x·w = 1.44

Norm ‖x‖ = 1.34 ‖w‖ = 1.34

Angle θ = 36.9°

Play with the coordinates. When the arrows point in the same direction the dot product is maximal. When they are perpendicular it is zero.

Three experiments to try:

Align the two vectors on the same direction (say $\mathbf{x} = \mathbf{w}$ ). The dot product hits its maximum, equal to $\|\mathbf{x}\| \cdot \|\mathbf{w}\|$ .
Place them perpendicularly (e.g. $\mathbf{x} = (1, 0)$ and $\mathbf{w} = (0, 1)$ ). The dot product is exactly zero.
Flip the direction of $\mathbf{w}$ (negative $w_1$ and $w_2$ ). The dot product becomes negative, because the arrows point in opposite directions.

To go further and prove that the geometric formulation of the dot product is not a definition dropped from the sky but a genuine consequence of the algebraic one, we first need two tools: the norm of a vector and a computational identity about norms. We set them up now, then use them in the Cauchy-Schwarz section that follows.

Norm and distance

The norm (or length) of a vector $\mathbf{x} \in \mathbb{R}^n$ is defined by the dot product with itself:

\|\mathbf{x}\| = \sqrt{\mathbf{x} \cdot \mathbf{x}} = \sqrt{x_1^2 + x_2^2 + \dots + x_n^2}

It generalises the Pythagorean theorem to $n$ dimensions. In 2D, $\|(x_1, x_2)\| = \sqrt{x_1^2 + x_2^2}$ , the length of the hypotenuse of a right triangle.

The distance between two vectors $\mathbf{u}$ and $\mathbf{v}$ is the norm of their difference: $\|\mathbf{u} - \mathbf{v}\|$ . That is how you measure “how similar two images are” when each image is represented as a vector.

A useful proof: the squared-norm expansion

One property comes back constantly in machine learning. We establish it once and for all, straight from the definitions:

\|\mathbf{x} + \mathbf{w}\|^2 = \|\mathbf{x}\|^2 + 2\,\mathbf{x} \cdot \mathbf{w} + \|\mathbf{w}\|^2

This identity is exactly the generalised Pythagorean theorem. Immediate consequence: if $\mathbf{x}$ and $\mathbf{w}$ are orthogonal ( $\mathbf{x} \cdot \mathbf{w} = 0$ ), then $\|\mathbf{x} + \mathbf{w}\|^2 = \|\mathbf{x}\|^2 + \|\mathbf{w}\|^2$ . Pythagoras at its purest, without a triangle or an explicit right angle.

Cauchy-Schwarz, or why the geometric formula is legitimate

Many courses present the formula $\mathbf{x} \cdot \mathbf{w} = \|\mathbf{x}\| \, \|\mathbf{w}\| \cos(\theta)$ as a second definition of the dot product, dropped from the sky. That is not honest. The right reading is the other way round: we define the dot product algebraically (sum of coordinate-wise products), we prove a fundamental inequality, and that inequality is what makes the geometric formula legitimate.

The Cauchy-Schwarz inequality states that for all vectors $\mathbf{x}, \mathbf{w} \in \mathbb{R}^n$ :

|\mathbf{x} \cdot \mathbf{w}| \leq \|\mathbf{x}\| \cdot \|\mathbf{w}\|

Equality holds if and only if the two vectors are colinear (one is a scalar multiple of the other).

Proof: Cauchy-Schwarz by the discriminant

If $\mathbf{w} = \mathbf{0}$ , the inequality is trivial (both sides vanish). We therefore assume $\mathbf{w} \neq \mathbf{0}$ .

Step 1. Introduce the second-degree polynomial in $t \in \mathbb{R}$ :

P(t) = \|\mathbf{x} + t\mathbf{w}\|^2

A squared norm is always non-negative, so $P(t) \geq 0$ for every $t$ .

Step 2. Expand $P(t)$ using the squared-norm expansion proved just above:

P(t) = \|\mathbf{x}\|^2 + 2 t \, (\mathbf{x} \cdot \mathbf{w}) + t^2 \|\mathbf{w}\|^2

This is a second-degree polynomial in $t$ , with leading coefficient $\|\mathbf{w}\|^2 > 0$ .

Step 3. A second-degree polynomial with strictly positive leading coefficient is always non-negative if and only if its discriminant is non-positive (otherwise it would have two distinct real roots and take strictly negative values between them).

The discriminant equals:

\Delta = (2 \, \mathbf{x} \cdot \mathbf{w})^2 - 4 \, \|\mathbf{x}\|^2 \, \|\mathbf{w}\|^2 = 4 \left[ (\mathbf{x} \cdot \mathbf{w})^2 - \|\mathbf{x}\|^2 \, \|\mathbf{w}\|^2 \right]

Step 4. The condition $\Delta \leq 0$ becomes:

(\mathbf{x} \cdot \mathbf{w})^2 \leq \|\mathbf{x}\|^2 \, \|\mathbf{w}\|^2

Taking the square root (both sides are non-negative):

|\mathbf{x} \cdot \mathbf{w}| \leq \|\mathbf{x}\| \, \|\mathbf{w}\|

Result. Cauchy-Schwarz is proved. Equality holds iff $\Delta = 0$ , that is iff $P$ admits a double root $t_0$ , that is iff $\mathbf{x} + t_0 \mathbf{w} = \mathbf{0}$ , in other words iff $\mathbf{x}$ and $\mathbf{w}$ are colinear. ∎

Now that we have the bound $|\mathbf{x} \cdot \mathbf{w}| \leq \|\mathbf{x}\| \, \|\mathbf{w}\|$ , we can divide safely. For non-zero $\mathbf{x}, \mathbf{w}$ , we define:

\cos\theta \;:=\; \dfrac{\mathbf{x} \cdot \mathbf{w}}{\|\mathbf{x}\| \, \|\mathbf{w}\|}

This quantity lies in $[-1, 1]$ thanks to Cauchy-Schwarz. We can therefore identify it as the cosine of a unique angle $\theta \in [0, \pi]$ . Rearranging gives back exactly the geometric formula:

\mathbf{x} \cdot \mathbf{w} = \|\mathbf{x}\| \, \|\mathbf{w}\| \cos(\theta)

But this formula is no longer a mysterious postulate: it is a direct consequence of the algebraic definition and of Cauchy-Schwarz.

Why this definition matches the Euclidean angle in 2D

We defined $\cos\theta$ by a purely algebraic formula. It remains to check that in dimension 2 this $\theta$ does coincide with the geometric angle between the two vectors.

Step 1. Up to a rotation of the frame (which preserves lengths and angles), we may choose axes such that $\mathbf{x}$ points along the first axis:

\mathbf{x} = (\|\mathbf{x}\|,\ 0)

Step 2. Let $\alpha \in [0, \pi]$ be the geometric angle between $\mathbf{x}$ and $\mathbf{w}$ . By definition of polar coordinates, $\mathbf{w}$ writes:

\mathbf{w} = \big( \, \|\mathbf{w}\| \cos\alpha, \ \|\mathbf{w}\| \sin\alpha \, \big)

Step 3. Compute the dot product directly:

\mathbf{x} \cdot \mathbf{w} = \|\mathbf{x}\| \cdot \|\mathbf{w}\| \cos\alpha + 0 \cdot \|\mathbf{w}\| \sin\alpha = \|\mathbf{x}\| \, \|\mathbf{w}\| \cos\alpha

Step 4. Substitute into the algebraic definition of $\cos\theta$ :

\cos\theta = \dfrac{\|\mathbf{x}\| \, \|\mathbf{w}\| \cos\alpha}{\|\mathbf{x}\| \, \|\mathbf{w}\|} = \cos\alpha

Result. Since $\theta$ and $\alpha$ both lie in $[0, \pi]$ , and $\cos$ is injective on that interval, we conclude $\theta = \alpha$ . The algebraic definition of $\cos\theta$ reproduces exactly the Euclidean geometric angle. ∎

The usual consequences of the geometric formula can be read straight off the cosine function:

If $\theta = 0$ (vectors aligned, same direction), $\cos(\theta) = 1$ : maximum dot product.
If $\theta = 90°$ (vectors perpendicular), $\cos(\theta) = 0$ : zero dot product.
If $\theta = 180°$ (vectors opposite), $\cos(\theta) = -1$ : minimum dot product.

Geometrically, the dot product measures how much two vectors point in the same direction, weighted by their lengths. That is precisely what a neuron wants to know: “do my inputs look like my weights?”

In machine learning, Cauchy-Schwarz also guarantees that you can always normalise a dot product by the lengths to get a measure in $[-1, 1]$ , called cosine similarity. It is the basic tool to compare two vector representations, for example two word or sentence embeddings.

Transpose and matrix product

Before stacking neurons into a full layer, we still need two matrix operations that come up everywhere in deep networks: the transpose of a matrix and the product of two matrices. Both are the direct sequel to the dot product seen above.

The transpose

The transpose of a matrix $A \in \mathbb{R}^{m \times n}$ , written $A^T$ , is the matrix obtained by swapping its rows and columns. Formally:

A^T \in \mathbb{R}^{n \times m} \quad \text{with} \quad (A^T)_{ij} = A_{ji}

Concrete example with a $2 \times 3$ matrix that becomes $3 \times 2$ :

A = \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix} \quad \Longrightarrow \quad A^T = \begin{pmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{pmatrix}

The first row of $A$ becomes the first column of $A^T$ , and so on.

The matrix-matrix product

The matrix product generalises the matrix-vector product. For $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$ , the product $AB$ lives in $\mathbb{R}^{m \times p}$ and has entries:

(AB)_{ij} = \sum_{k=1}^{n} A_{ik} \, B_{kj}

Compatibility condition: the number of columns of $A$ must equal the number of rows of $B$ (both equal $n$ here). Otherwise the product is undefined.

A $2 \times 2$ worked example:

\begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \begin{pmatrix} 0 & 1 \\ 1 & 1 \end{pmatrix} = \begin{pmatrix} 1 \cdot 0 + 2 \cdot 1 & 1 \cdot 1 + 2 \cdot 1 \\ 3 \cdot 0 + 4 \cdot 1 & 3 \cdot 1 + 4 \cdot 1 \end{pmatrix} = \begin{pmatrix} 2 & 3 \\ 4 & 7 \end{pmatrix}

The property $(AB)^T = B^T A^T$

A rule that comes back constantly in machine learning: the transpose of a product equals the product of the transposes in reverse order.

(AB)^T = B^T A^T

This is a short proof by direct comparison of entries.

Step 1. Compare entry $(i, j)$ of both matrices. By definition of the transpose:

\big( (AB)^T \big)_{ij} = (AB)_{ji}

Step 2. By definition of the matrix product:

(AB)_{ji} = \sum_{k=1}^{n} A_{jk} \, B_{ki}

Step 3. Inside the sum, recognise the transposed entries: $A_{jk} = (A^T)_{kj}$ and $B_{ki} = (B^T)_{ik}$ . Rewrite:

\sum_{k=1}^{n} A_{jk} \, B_{ki} = \sum_{k=1}^{n} (B^T)_{ik} \, (A^T)_{kj}

Note the swapped order: $B^T$ now comes before $A^T$ and their indices chain correctly (they share the summation index $k$ ).

Step 4. That sum is exactly entry $(i, j)$ of the product $B^T A^T$ :

\sum_{k=1}^{n} (B^T)_{ik} \, (A^T)_{kj} = (B^T A^T)_{ij}

Result. For all $i, j$ , $\big( (AB)^T \big)_{ij} = (B^T A^T)_{ij}$ , so the two matrices are equal: $(AB)^T = B^T A^T$ . ∎

With the transpose and the matrix-matrix product in hand, we can finally stack the neurons of a layer into a single matrix and write the whole layer’s output in one go.

Stacking neurons

What is a matrix

A matrix of size $m \times n$ is a rectangular table of numbers arranged in $m$ rows and $n$ columns:

W = \begin{pmatrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22} & \cdots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m1} & w_{m2} & \cdots & w_{mn} \end{pmatrix} \in \mathbb{R}^{m \times n}

The entry $w_{ji}$ reads “row $j$ , column $i$ ”. Each row, taken as a vector of dimension $n$ , is itself a vector in $\mathbb{R}^n$ .

From a vector to a full layer

A single layer of $m$ neurons, each receiving $n$ inputs, can be written using a matrix of weights $W \in \mathbb{R}^{m \times n}$ . Each row $\mathbf{w}_j$ of $W$ holds the weights of the $j$ -th neuron.

To compute all the layer’s outputs at once, we use matrix-vector multiplication:

W \mathbf{x} = \begin{pmatrix} \mathbf{w}_1 \cdot \mathbf{x} \\ \mathbf{w}_2 \cdot \mathbf{x} \\ \vdots \\ \mathbf{w}_m \cdot \mathbf{x} \end{pmatrix}

It is an operation that takes a vector and returns another vector, where each coordinate of the result is a dot product. We will dig into this at chapter 5 when we build a multi-layer network.

In one sentence

A vector is an ordered list of $n$ numbers. The dot product of two vectors is the sum of their coordinate-wise products. And a neuron computes exactly that dot product between its inputs and its weights, plus a bias.

On to chapter 3

The neuron does not stop at the dot product: it then applies an activation function $f$ to turn the raw result into something interpretable. Chapter 1 grazed over it, chapter 2 laid down the mathematical pieces around it. Chapter 3 finally delivers the missing part.

You will see, with a matrix-based proof, that the absence of non-linearity makes a stack of matrices collapse into a single one. Concretely, if a first layer computes $\mathbf{h} = W_1 \mathbf{x}$ and a second one $\mathbf{y} = W_2 \mathbf{h}$ , then:

\mathbf{y} = W_2 (W_1 \mathbf{x}) = (W_2 W_1) \, \mathbf{x}

Thanks to the matrix-matrix product you just learned, $W_2 W_1$ is a single matrix. Two activation-free layers collapse to one. That is the central theorem of chapter 3, and on its own it justifies the existence of non-linear activation functions (sigmoid, ReLU, tanh) that you will learn to compare and choose between.

Exercises

Exercise 1: compute a dot product

Let $\mathbf{x} = (2, -1, 3)$ and $\mathbf{w} = (1, 4, -2)$ . Compute $\mathbf{x} \cdot \mathbf{w}$ .

Exercise 2: norm of a vector

Compute the norm of the vector $\mathbf{u} = (3, 4)$ using the definition.

Exercise 3: dot product and orthogonality

Find a value of $a$ such that the vectors $\mathbf{x} = (a, 1)$ and $\mathbf{w} = (2, 1)$ are orthogonal (zero dot product).

Exercise 4: prove $(AB)^T = B^T A^T$ on a concrete case

Let $A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}$ and $B = \begin{pmatrix} 0 & 1 \\ 1 & 1 \end{pmatrix}$ .

(a) Compute $AB$ .

(b) Compute $(AB)^T$ .

(d) Compute $B^T A^T$ .

(e) Verify that $(AB)^T = B^T A^T$ .

Solution to exercise 1: compute a dot product

We have $\mathbf{x} = (2, -1, 3)$ and $\mathbf{w} = (1, 4, -2)$ .

Step 1. Write the general formula for a 3-dimensional dot product:

\mathbf{x} \cdot \mathbf{w} = x_1 w_1 + x_2 w_2 + x_3 w_3

Step 2. Substitute the coordinates:

\mathbf{x} \cdot \mathbf{w} = 2 \times 1 + (-1) \times 4 + 3 \times (-2)

Step 3. Compute each product separately:

2 \times 1 = 2

(-1) \times 4 = -4

3 \times (-2) = -6

Step 4. Sum the three results:

\mathbf{x} \cdot \mathbf{w} = 2 + (-4) + (-6) = 2 - 4 - 6 = -8

Result. $\mathbf{x} \cdot \mathbf{w} = -8$ .

Solution to exercise 3: dot product and orthogonality

We have $\mathbf{x} = (a, 1)$ and $\mathbf{w} = (2, 1)$ . We want $a$ such that $\mathbf{x} \cdot \mathbf{w} = 0$ .

Step 1. Express the dot product as a function of $a$ :

\mathbf{x} \cdot \mathbf{w} = a \times 2 + 1 \times 1 = 2a + 1

Step 2. Set the orthogonality equation:

2a + 1 = 0

Step 3. Solve for $a$ :

2a = -1

a = -\dfrac{1}{2}

Step 4. Verify with $\mathbf{x} = (-\tfrac{1}{2}, 1)$ and $\mathbf{w} = (2, 1)$ :

\mathbf{x} \cdot \mathbf{w} = \left(-\dfrac{1}{2}\right) \times 2 + 1 \times 1 = -1 + 1 = 0

Result. $a = -\dfrac{1}{2}$ . The two vectors are then orthogonal. ✓

Solution to exercise 4: prove $(AB)^T = B^T A^T$ on a concrete case

Given: $A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}$ , $B = \begin{pmatrix} 0 & 1 \\ 1 & 1 \end{pmatrix}$ .

Step 1. Compute $AB$ . Each entry is a dot product of a row of $A$ with a column of $B$ .

(AB)_{11} = 1 \times 0 + 2 \times 1 = 2

(AB)_{12} = 1 \times 1 + 2 \times 1 = 3

(AB)_{21} = 3 \times 0 + 4 \times 1 = 4

(AB)_{22} = 3 \times 1 + 4 \times 1 = 7

So:

AB = \begin{pmatrix} 2 & 3 \\ 4 & 7 \end{pmatrix}

Step 2. Compute $(AB)^T$ by swapping rows and columns:

(AB)^T = \begin{pmatrix} 2 & 4 \\ 3 & 7 \end{pmatrix}

Step 3. Compute $A^T$ and $B^T$ :

A^T = \begin{pmatrix} 1 & 3 \\ 2 & 4 \end{pmatrix}, \qquad B^T = \begin{pmatrix} 0 & 1 \\ 1 & 1 \end{pmatrix}

Note that $B$ is symmetric: $B^T = B$ .

Step 4. Compute $B^T A^T$ entry by entry:

(B^T A^T)_{11} = 0 \times 1 + 1 \times 2 = 2

(B^T A^T)_{12} = 0 \times 3 + 1 \times 4 = 4

(B^T A^T)_{21} = 1 \times 1 + 1 \times 2 = 3

(B^T A^T)_{22} = 1 \times 3 + 1 \times 4 = 7

So:

B^T A^T = \begin{pmatrix} 2 & 4 \\ 3 & 7 \end{pmatrix}

Step 5. Comparison.

(AB)^T = \begin{pmatrix} 2 & 4 \\ 3 & 7 \end{pmatrix} = B^T A^T

Result. The identity $(AB)^T = B^T A^T$ is verified on this concrete case. ✓ Note that computing $A^T B^T$ (wrong order) would yield a different matrix: the swapped order is what matters.

Sources

Anton, H. & Rorres, C. (2010). Elementary Linear Algebra, 10th edition. Wiley. Chapters 1 to 3 for the basics.
Strang, G. (2016). Introduction to Linear Algebra, 5th edition. Wellesley-Cambridge Press. MIT OCW companion course is free and excellent.

The vector, an ordered list of numbers

Definition

Why vectors in machine learning

The dot product

Definition

Worked example

See two vectors interacting

Norm and distance

A useful proof: the squared-norm expansion

Cauchy-Schwarz, or why the geometric formula is legitimate

Transpose and matrix product

The transpose

The matrix-matrix product

The property (AB)T=BTAT(AB)^T = B^T A^T(AB)T=BTAT

Stacking neurons

What is a matrix

From a vector to a full layer

In one sentence

On to chapter 3

Exercises

Exercise 1: compute a dot product

Exercise 2: norm of a vector

Exercise 3: dot product and orthogonality

Exercise 4: prove (AB)T=BTAT(AB)^T = B^T A^T(AB)T=BTAT on a concrete case

Sources

Further reading

The property $(AB)^T = B^T A^T$

Exercise 4: prove $(AB)^T = B^T A^T$ on a concrete case