02 / 09 Essential linear algebra
  1. ← Neural networks: foundations and mathematics
  2. 00 Foreword
  3. 01 The artificial neuron
  4. 02 Essential linear algebra
  5. 03 Activation functions
  6. 04 The perceptron
  7. 05 From the neuron to the multilayer network
  8. 06 Forward pass and loss functions
  9. 07 Derivatives and the chain rule
  10. 08 Backpropagation
Neural networks: foundations and mathematics · 02 / 09

Essential linear algebra

Vectors, dot products and matrices: exactly what you need to speak neural-network language.

You saw in the previous chapter that the neuron formula reads in vector notation as y=f(wx+b)y = f(\mathbf{w} \cdot \mathbf{x} + b). This chapter sets up the mathematical bricks hiding behind that notation. The goal: that you read wx\mathbf{w} \cdot \mathbf{x} without flinching and that you understand what it says geometrically.

The vector, an ordered list of numbers

Definition

A vector Vector A mathematical object represented as an ordered list of numbers. A vector of dimension n encodes n values. In machine learning, a neuron's inputs and weights are each a vector of the same dimension. of dimension nn is an ordered list of nn real numbers. We typically write it with a bold letter, in parentheses or square brackets:

x=(x1,x2,,xn)Rn\mathbf{x} = (x_1, x_2, \dots, x_n) \in \mathbb{R}^n

The symbol Rn\mathbb{R}^n reads “R to the n” and denotes the set of all ordered lists of nn real numbers. Each xix_i is called the ii-th coordinate or component of the vector.

Why vectors in machine learning

Anything you can describe as a list of numbers is a vector. A few examples:

  • A 28×28 greyscale image of a handwritten digit: a vector of dimension 784784 (each pixel becomes a value between 0 and 1).
  • A medical patient described by age, blood pressure, glycemia, cholesterol: a vector of dimension 4.
  • The inputs of a neuron x1,x2,x3x_1, x_2, x_3 seen in chapter 1: a vector of dimension 3.

A neuron’s weights also form a vector, of the same dimension as its inputs. That correspondence is what makes the dot product possible.

The dot product

Definition

The dot product Dot product An operation taking two vectors of equal dimension and returning a single number, computed as the sum of element-wise products. It is exactly the computation a neuron performs between its inputs and weights. of two vectors x,wRn\mathbf{x}, \mathbf{w} \in \mathbb{R}^n is the real number:

xw=i=1nxiwi=x1w1+x2w2++xnwn\mathbf{x} \cdot \mathbf{w} = \sum_{i=1}^{n} x_i w_i = x_1 w_1 + x_2 w_2 + \dots + x_n w_n

It is an operation that takes two vectors and returns a single number. Read it “x dot w” or “dot product of x and w”.

Worked example

Reusing the referee example. It is exactly the same weighted sum as in chapter 1, written this time in vector notation:

x=(1,0,1),w=(0.8,0.5,0.9)\mathbf{x} = (1, 0, 1), \quad \mathbf{w} = (0.8, 0.5, 0.9)

The dot product reads:

xw=1×0.8+0×0.5+1×0.9=1.7\mathbf{x} \cdot \mathbf{w} = 1 \times 0.8 + 0 \times 0.5 + 1 \times 0.9 = 1.7

That is the neuron’s weighted sum, without the bias. Adding b=0.5b = -0.5 gives z=1.70.5=1.2z = 1.7 - 0.5 = 1.2, exactly as in the previous chapter.

See two vectors interacting

The component below draws two vectors x\mathbf{x} and w\mathbf{w} in the plane. Move the sliders and watch three things at once: the dot product changes, but so do the norm Norm The length of a vector, measured as the square root of the sum of its squares. For a vector x = (x₁, ..., xₙ), the norm ‖x‖ = √(x₁² + ... + xₙ²). It is the generalisation of the Pythagorean theorem to n dimensions. and the angle between them. When the angle gets close to 90°90°, the dot product drops to zero: the vectors become orthogonal Orthogonality Two vectors are orthogonal when their dot product is zero. Geometrically, this matches a 90-degree angle between them. In machine learning, orthogonal inputs contribute independently to a neuron's computation. .

xwxy
Dot product x·w = 1.44
Normx‖ = 1.34 w‖ = 1.34
Angle θ = 36.9°

Play with the coordinates. When the arrows point in the same direction the dot product is maximal. When they are perpendicular it is zero.

Three experiments to try:

  • Align the two vectors on the same direction (say x=w\mathbf{x} = \mathbf{w}). The dot product hits its maximum, equal to xw\|\mathbf{x}\| \cdot \|\mathbf{w}\|.
  • Place them perpendicularly (e.g. x=(1,0)\mathbf{x} = (1, 0) and w=(0,1)\mathbf{w} = (0, 1)). The dot product is exactly zero.
  • Flip the direction of w\mathbf{w} (negative w1w_1 and w2w_2). The dot product becomes negative, because the arrows point in opposite directions.

To go further and prove that the geometric formulation of the dot product is not a definition dropped from the sky but a genuine consequence of the algebraic one, we first need two tools: the norm of a vector and a computational identity about norms. We set them up now, then use them in the Cauchy-Schwarz section that follows.

Norm and distance

The norm (or length) of a vector xRn\mathbf{x} \in \mathbb{R}^n is defined by the dot product with itself:

x=xx=x12+x22++xn2\|\mathbf{x}\| = \sqrt{\mathbf{x} \cdot \mathbf{x}} = \sqrt{x_1^2 + x_2^2 + \dots + x_n^2}

It generalises the Pythagorean theorem to nn dimensions. In 2D, (x1,x2)=x12+x22\|(x_1, x_2)\| = \sqrt{x_1^2 + x_2^2}, the length of the hypotenuse of a right triangle.

The distance between two vectors u\mathbf{u} and v\mathbf{v} is the norm of their difference: uv\|\mathbf{u} - \mathbf{v}\|. That is how you measure “how similar two images are” when each image is represented as a vector.

A useful proof: the squared-norm expansion

One property comes back constantly in machine learning. We establish it once and for all, straight from the definitions:

x+w2=x2+2xw+w2\|\mathbf{x} + \mathbf{w}\|^2 = \|\mathbf{x}\|^2 + 2\,\mathbf{x} \cdot \mathbf{w} + \|\mathbf{w}\|^2

This identity is exactly the generalised Pythagorean theorem. Immediate consequence: if x\mathbf{x} and w\mathbf{w} are orthogonal (xw=0\mathbf{x} \cdot \mathbf{w} = 0), then x+w2=x2+w2\|\mathbf{x} + \mathbf{w}\|^2 = \|\mathbf{x}\|^2 + \|\mathbf{w}\|^2. Pythagoras at its purest, without a triangle or an explicit right angle.

Cauchy-Schwarz, or why the geometric formula is legitimate

Many courses present the formula xw=xwcos(θ)\mathbf{x} \cdot \mathbf{w} = \|\mathbf{x}\| \, \|\mathbf{w}\| \cos(\theta) as a second definition of the dot product, dropped from the sky. That is not honest. The right reading is the other way round: we define the dot product algebraically (sum of coordinate-wise products), we prove a fundamental inequality, and that inequality is what makes the geometric formula legitimate.

The Cauchy-Schwarz Cauchy-Schwarz inequality For any two vectors x and w in Rⁿ, |x · w| ≤ ‖x‖ · ‖w‖. Equality holds only if the two vectors are colinear. It is the fundamental inequality of linear algebra, ensuring consistency between the algebraic and geometric formulations of the dot product. Source: Cauchy 1821, Schwarz 1888 inequality states that for all vectors x,wRn\mathbf{x}, \mathbf{w} \in \mathbb{R}^n:

xwxw|\mathbf{x} \cdot \mathbf{w}| \leq \|\mathbf{x}\| \cdot \|\mathbf{w}\|

Equality holds if and only if the two vectors are colinear (one is a scalar multiple of the other).

Now that we have the bound xwxw|\mathbf{x} \cdot \mathbf{w}| \leq \|\mathbf{x}\| \, \|\mathbf{w}\|, we can divide safely. For non-zero x,w\mathbf{x}, \mathbf{w}, we define:

cosθ  :=  xwxw\cos\theta \;:=\; \dfrac{\mathbf{x} \cdot \mathbf{w}}{\|\mathbf{x}\| \, \|\mathbf{w}\|}

This quantity lies in [1,1][-1, 1] thanks to Cauchy-Schwarz. We can therefore identify it as the cosine of a unique angle θ[0,π]\theta \in [0, \pi]. Rearranging gives back exactly the geometric formula:

xw=xwcos(θ)\mathbf{x} \cdot \mathbf{w} = \|\mathbf{x}\| \, \|\mathbf{w}\| \cos(\theta)

But this formula is no longer a mysterious postulate: it is a direct consequence of the algebraic definition and of Cauchy-Schwarz.

The usual consequences of the geometric formula can be read straight off the cosine function:

  • If θ=0\theta = 0 (vectors aligned, same direction), cos(θ)=1\cos(\theta) = 1: maximum dot product.
  • If θ=90°\theta = 90° (vectors perpendicular), cos(θ)=0\cos(\theta) = 0: zero dot product.
  • If θ=180°\theta = 180° (vectors opposite), cos(θ)=1\cos(\theta) = -1: minimum dot product.

Geometrically, the dot product measures how much two vectors point in the same direction, weighted by their lengths. That is precisely what a neuron wants to know: “do my inputs look like my weights?”

In machine learning, Cauchy-Schwarz also guarantees that you can always normalise a dot product by the lengths to get a measure in [1,1][-1, 1], called cosine similarity. It is the basic tool to compare two vector representations, for example two word or sentence embeddings.

Transpose and matrix product

Before stacking neurons into a full layer, we still need two matrix operations that come up everywhere in deep networks: the transpose of a matrix and the product of two matrices. Both are the direct sequel to the dot product seen above.

The transpose

The transpose of a matrix ARm×nA \in \mathbb{R}^{m \times n}, written ATA^T, is the matrix obtained by swapping its rows and columns. Formally:

ATRn×mwith(AT)ij=AjiA^T \in \mathbb{R}^{n \times m} \quad \text{with} \quad (A^T)_{ij} = A_{ji}

Concrete example with a 2×32 \times 3 matrix that becomes 3×23 \times 2:

A=(123456)AT=(142536)A = \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix} \quad \Longrightarrow \quad A^T = \begin{pmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{pmatrix}

The first row of AA becomes the first column of ATA^T, and so on.

The matrix-matrix product

The matrix product generalises the matrix-vector product. For ARm×nA \in \mathbb{R}^{m \times n} and BRn×pB \in \mathbb{R}^{n \times p}, the product ABAB lives in Rm×p\mathbb{R}^{m \times p} and has entries:

(AB)ij=k=1nAikBkj(AB)_{ij} = \sum_{k=1}^{n} A_{ik} \, B_{kj}

Compatibility condition: the number of columns of AA must equal the number of rows of BB (both equal nn here). Otherwise the product is undefined.

A 2×22 \times 2 worked example:

(1234)(0111)=(10+2111+2130+4131+41)=(2347)\begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} \begin{pmatrix} 0 & 1 \\ 1 & 1 \end{pmatrix} = \begin{pmatrix} 1 \cdot 0 + 2 \cdot 1 & 1 \cdot 1 + 2 \cdot 1 \\ 3 \cdot 0 + 4 \cdot 1 & 3 \cdot 1 + 4 \cdot 1 \end{pmatrix} = \begin{pmatrix} 2 & 3 \\ 4 & 7 \end{pmatrix}

The property (AB)T=BTAT(AB)^T = B^T A^T

A rule that comes back constantly in machine learning: the transpose of a product equals the product of the transposes in reverse order.

(AB)T=BTAT(AB)^T = B^T A^T

This is a short proof by direct comparison of entries.

Step 1. Compare entry (i,j)(i, j) of both matrices. By definition of the transpose:

((AB)T)ij=(AB)ji\big( (AB)^T \big)_{ij} = (AB)_{ji}

Step 2. By definition of the matrix product:

(AB)ji=k=1nAjkBki(AB)_{ji} = \sum_{k=1}^{n} A_{jk} \, B_{ki}

Step 3. Inside the sum, recognise the transposed entries: Ajk=(AT)kjA_{jk} = (A^T)_{kj} and Bki=(BT)ikB_{ki} = (B^T)_{ik}. Rewrite:

k=1nAjkBki=k=1n(BT)ik(AT)kj\sum_{k=1}^{n} A_{jk} \, B_{ki} = \sum_{k=1}^{n} (B^T)_{ik} \, (A^T)_{kj}

Note the swapped order: BTB^T now comes before ATA^T and their indices chain correctly (they share the summation index kk).

Step 4. That sum is exactly entry (i,j)(i, j) of the product BTATB^T A^T:

k=1n(BT)ik(AT)kj=(BTAT)ij\sum_{k=1}^{n} (B^T)_{ik} \, (A^T)_{kj} = (B^T A^T)_{ij}

Result. For all i,ji, j, ((AB)T)ij=(BTAT)ij\big( (AB)^T \big)_{ij} = (B^T A^T)_{ij}, so the two matrices are equal: (AB)T=BTAT(AB)^T = B^T A^T. ∎

With the transpose and the matrix-matrix product in hand, we can finally stack the neurons of a layer into a single matrix and write the whole layer’s output in one go.

Stacking neurons

What is a matrix

A matrix Matrix A rectangular array of numbers organised in rows and columns. An m×n matrix has m rows and n columns. In a neural network, a layer of m neurons each having n inputs collapses into an m×n weight matrix. of size m×nm \times n is a rectangular table of numbers arranged in mm rows and nn columns:

W=(w11w12w1nw21w22w2nwm1wm2wmn)Rm×nW = \begin{pmatrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22} & \cdots & w_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m1} & w_{m2} & \cdots & w_{mn} \end{pmatrix} \in \mathbb{R}^{m \times n}

The entry wjiw_{ji} reads “row jj, column ii”. Each row, taken as a vector of dimension nn, is itself a vector in Rn\mathbb{R}^n.

From a vector to a full layer

A single layer of mm neurons, each receiving nn inputs, can be written using a matrix of weights WRm×nW \in \mathbb{R}^{m \times n}. Each row wj\mathbf{w}_j of WW holds the weights of the jj-th neuron.

To compute all the layer’s outputs at once, we use matrix-vector multiplication:

Wx=(w1xw2xwmx)W \mathbf{x} = \begin{pmatrix} \mathbf{w}_1 \cdot \mathbf{x} \\ \mathbf{w}_2 \cdot \mathbf{x} \\ \vdots \\ \mathbf{w}_m \cdot \mathbf{x} \end{pmatrix}

It is an operation that takes a vector and returns another vector, where each coordinate of the result is a dot product. We will dig into this at chapter 5 when we build a multi-layer network.

In one sentence

A vector is an ordered list of nn numbers. The dot product of two vectors is the sum of their coordinate-wise products. And a neuron computes exactly that dot product between its inputs and its weights, plus a bias.

On to chapter 3

The neuron does not stop at the dot product: it then applies an activation function ff to turn the raw result into something interpretable. Chapter 1 grazed over it, chapter 2 laid down the mathematical pieces around it. Chapter 3 finally delivers the missing part.

You will see, with a matrix-based proof, that the absence of non-linearity makes a stack of matrices collapse into a single one. Concretely, if a first layer computes h=W1x\mathbf{h} = W_1 \mathbf{x} and a second one y=W2h\mathbf{y} = W_2 \mathbf{h}, then:

y=W2(W1x)=(W2W1)x\mathbf{y} = W_2 (W_1 \mathbf{x}) = (W_2 W_1) \, \mathbf{x}

Thanks to the matrix-matrix product you just learned, W2W1W_2 W_1 is a single matrix. Two activation-free layers collapse to one. That is the central theorem of chapter 3, and on its own it justifies the existence of non-linear activation functions (sigmoid, ReLU, tanh) that you will learn to compare and choose between.

Exercises

Exercise 1: compute a dot product

Let x=(2,1,3)\mathbf{x} = (2, -1, 3) and w=(1,4,2)\mathbf{w} = (1, 4, -2). Compute xw\mathbf{x} \cdot \mathbf{w}.

Exercise 2: norm of a vector

Compute the norm of the vector u=(3,4)\mathbf{u} = (3, 4) using the definition.

Exercise 3: dot product and orthogonality

Find a value of aa such that the vectors x=(a,1)\mathbf{x} = (a, 1) and w=(2,1)\mathbf{w} = (2, 1) are orthogonal (zero dot product).

Exercise 4: prove (AB)T=BTAT(AB)^T = B^T A^T on a concrete case

Let A=(1234)A = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix} and B=(0111)B = \begin{pmatrix} 0 & 1 \\ 1 & 1 \end{pmatrix}.

(a) Compute ABAB.

(b) Compute (AB)T(AB)^T.

(c) Compute ATA^T and BTB^T.

(d) Compute BTATB^T A^T.

(e) Verify that (AB)T=BTAT(AB)^T = B^T A^T.

Sources

  • Anton, H. & Rorres, C. (2010). Elementary Linear Algebra, 10th edition. Wiley. Chapters 1 to 3 for the basics.
  • Strang, G. (2016). Introduction to Linear Algebra, 5th edition. Wellesley-Cambridge Press. MIT OCW companion course is free and excellent.

Further reading

  • Strang, G. (online course MIT 18.06 Linear Algebra). The reference course on linear algebra in major universities, freely available. ocw.mit.edu
  • 3Blue1Brown, series Essence of Linear Algebra. Remarkable visualisations of the key concepts. youtube.com/playlist
  • Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Chapter 2 on ML-specific linear algebra. deeplearningbook.org
  • Lay, D. C., Lay, S. R. & McDonald, J. J. (2021). Linear Algebra and Its Applications, 6th edition. Pearson. A classical reference for applied linear algebra, balanced between theory and exercises.
  • Axler, S. (2024). Linear Algebra Done Right, 4th edition. Springer. A conceptual approach that delays determinants as long as possible and emphasises the structure of vector spaces. Free open-access PDF
Quiz
  1. 1. What does the dot product of two vectors return?

  2. 2. If two vectors are orthogonal, their dot product equals:

  3. 3. The norm of a vector x = (x₁, x₂, ..., xₙ) is:

  4. 4. Why do a neuron's weights share the dimension of its inputs?

  5. 5. What does the matrix-vector operation W·x compute, where W is m×n and x has dimension n?

  6. 6. If A is a 3×4 matrix and B a 4×2 matrix, what is the size of the product AB?