Neural networks: foundations and mathematics · 02 / 09
Essential linear algebra
Vectors, dot products and matrices: exactly what you need to speak neural-network language.
You saw in the previous chapter that the neuron formula reads in vector notation as y=f(w⋅x+b). This chapter sets up the mathematical bricks hiding behind that notation. The goal: that you read w⋅x without flinching and that you understand what it says geometrically.
The symbol Rn reads “R to the n” and denotes the set of all ordered lists of n real numbers. Each xi is called the i-th coordinate or component of the vector.
Why vectors in machine learning
Anything you can describe as a list of numbers is a vector. A few examples:
A 28×28 greyscale image of a handwritten digit: a vector of dimension 784 (each pixel becomes a value between 0 and 1).
A medical patient described by age, blood pressure, glycemia, cholesterol: a vector of dimension 4.
The inputs of a neuron x1,x2,x3 seen in chapter 1: a vector of dimension 3.
A neuron’s weights also form a vector, of the same dimension as its inputs. That correspondence is what makes the dot product possible.
Play with the coordinates. When the arrows point in the same direction the dot product is maximal. When they are perpendicular it is zero.
Figure: two vectors and their dot product (interactive)
2D plot drawing two vectors x=(1.2, 0.6) and w=(0.6, 1.2) from the origin. Sliders move the tips of both vectors and display three quantities simultaneously: the dot product x·w, the norms ‖x‖ and ‖w‖, and the angle between them. When the angle approaches 90°, the dot product drops to zero: the vectors become orthogonal.
Three experiments to try:
Align the two vectors on the same direction (say x=w). The dot product hits its maximum, equal to ∥x∥⋅∥w∥.
Place them perpendicularly (e.g. x=(1,0) and w=(0,1)). The dot product is exactly zero.
Flip the direction of w (negative w1 and w2). The dot product becomes negative, because the arrows point in opposite directions.
To go further and prove that the geometric formulation of the dot product is not a definition dropped from the sky but a genuine consequence of the algebraic one, we first need two tools: the norm of a vector and a computational identity about norms. We set them up now, then use them in the Cauchy-Schwarz section that follows.
Norm and distance
The norm (or length) of a vector x∈Rn is defined by the dot product with itself:
∥x∥=x⋅x=x12+x22+⋯+xn2
It generalises the Pythagorean theorem to n dimensions. In 2D, ∥(x1,x2)∥=x12+x22, the length of the hypotenuse of a right triangle.
The distance between two vectors u and v is the norm of their difference: ∥u−v∥. That is how you measure “how similar two images are” when each image is represented as a vector.
A useful proof: the squared-norm expansion
One property comes back constantly in machine learning. We establish it once and for all, straight from the definitions:
∥x+w∥2=∥x∥2+2x⋅w+∥w∥2
This identity is exactly the generalised Pythagorean theorem. Immediate consequence: if x and w are orthogonal (x⋅w=0), then ∥x+w∥2=∥x∥2+∥w∥2. Pythagoras at its purest, without a triangle or an explicit right angle.
Cauchy-Schwarz, or why the geometric formula is legitimate
Many courses present the formula x⋅w=∥x∥∥w∥cos(θ) as a second definition of the dot product, dropped from the sky. That is not honest. The right reading is the other way round: we define the dot product algebraically (sum of coordinate-wise products), we prove a fundamental inequality, and that inequality is what makes the geometric formula legitimate.
Equality holds if and only if the two vectors are colinear (one is a scalar multiple of the other).
Now that we have the bound ∣x⋅w∣≤∥x∥∥w∥, we can divide safely. For non-zero x,w, we define:
cosθ:=∥x∥∥w∥x⋅w
This quantity lies in [−1,1] thanks to Cauchy-Schwarz. We can therefore identify it as the cosine of a unique angle θ∈[0,π]. Rearranging gives back exactly the geometric formula:
x⋅w=∥x∥∥w∥cos(θ)
But this formula is no longer a mysterious postulate: it is a direct consequence of the algebraic definition and of Cauchy-Schwarz.
The usual consequences of the geometric formula can be read straight off the cosine function:
If θ=0 (vectors aligned, same direction), cos(θ)=1: maximum dot product.
If θ=90° (vectors perpendicular), cos(θ)=0: zero dot product.
If θ=180° (vectors opposite), cos(θ)=−1: minimum dot product.
Geometrically, the dot product measures how much two vectors point in the same direction, weighted by their lengths. That is precisely what a neuron wants to know: “do my inputs look like my weights?”
In machine learning, Cauchy-Schwarz also guarantees that you can always normalise a dot product by the lengths to get a measure in [−1,1], called cosine similarity. It is the basic tool to compare two vector representations, for example two word or sentence embeddings.
Transpose and matrix product
Before stacking neurons into a full layer, we still need two matrix operations that come up everywhere in deep networks: the transpose of a matrix and the product of two matrices. Both are the direct sequel to the dot product seen above.
The transpose
The transpose of a matrix A∈Rm×n, written AT, is the matrix obtained by swapping its rows and columns. Formally:
AT∈Rn×mwith(AT)ij=Aji
Concrete example with a 2×3 matrix that becomes 3×2:
A=(142536)⟹AT=123456
The first row of A becomes the first column of AT, and so on.
The matrix-matrix product
The matrix product generalises the matrix-vector product. For A∈Rm×n and B∈Rn×p, the product AB lives in Rm×p and has entries:
(AB)ij=k=1∑nAikBkj
Compatibility condition: the number of columns of A must equal the number of rows of B (both equal n here). Otherwise the product is undefined.
A rule that comes back constantly in machine learning: the transpose of a product equals the product of the transposes in reverse order.
(AB)T=BTAT
This is a short proof by direct comparison of entries.
Step 1. Compare entry (i,j) of both matrices. By definition of the transpose:
((AB)T)ij=(AB)ji
Step 2. By definition of the matrix product:
(AB)ji=k=1∑nAjkBki
Step 3. Inside the sum, recognise the transposed entries: Ajk=(AT)kj and Bki=(BT)ik. Rewrite:
k=1∑nAjkBki=k=1∑n(BT)ik(AT)kj
Note the swapped order: BT now comes before AT and their indices chain correctly (they share the summation index k).
Step 4. That sum is exactly entry (i,j) of the product BTAT:
k=1∑n(BT)ik(AT)kj=(BTAT)ij
Result. For all i,j, ((AB)T)ij=(BTAT)ij, so the two matrices are equal: (AB)T=BTAT. ∎
With the transpose and the matrix-matrix product in hand, we can finally stack the neurons of a layer into a single matrix and write the whole layer’s output in one go.
The entry wji reads “row j, column i”. Each row, taken as a vector of dimension n, is itself a vector in Rn.
From a vector to a full layer
A single layer of m neurons, each receiving n inputs, can be written using a matrix of weights W∈Rm×n. Each row wj of W holds the weights of the j-th neuron.
To compute all the layer’s outputs at once, we use matrix-vector multiplication:
Wx=w1⋅xw2⋅x⋮wm⋅x
It is an operation that takes a vector and returns another vector, where each coordinate of the result is a dot product. We will dig into this at chapter 5 when we build a multi-layer network.
In one sentence
A vector is an ordered list of n numbers. The dot product of two vectors is the sum of their coordinate-wise products. And a neuron computes exactly that dot product between its inputs and its weights, plus a bias.
On to chapter 3
The neuron does not stop at the dot product: it then applies an activation functionf to turn the raw result into something interpretable. Chapter 1 grazed over it, chapter 2 laid down the mathematical pieces around it. Chapter 3 finally delivers the missing part.
You will see, with a matrix-based proof, that the absence of non-linearity makes a stack of matrices collapse into a single one. Concretely, if a first layer computes h=W1x and a second one y=W2h, then:
y=W2(W1x)=(W2W1)x
Thanks to the matrix-matrix product you just learned, W2W1 is a single matrix. Two activation-free layers collapse to one. That is the central theorem of chapter 3, and on its own it justifies the existence of non-linear activation functions (sigmoid, ReLU, tanh) that you will learn to compare and choose between.
Exercises
Exercise 1: compute a dot product
Let x=(2,−1,3) and w=(1,4,−2). Compute x⋅w.
Exercise 2: norm of a vector
Compute the norm of the vector u=(3,4) using the definition.
Exercise 3: dot product and orthogonality
Find a value of a such that the vectors x=(a,1) and w=(2,1) are orthogonal (zero dot product).
Exercise 4: prove (AB)T=BTAT on a concrete case
Let A=(1324) and B=(0111).
(a) Compute AB.
(b) Compute (AB)T.
(c) Compute AT and BT.
(d) Compute BTAT.
(e) Verify that (AB)T=BTAT.
Sources
Anton, H. & Rorres, C. (2010). Elementary Linear Algebra, 10th edition. Wiley. Chapters 1 to 3 for the basics.
Strang, G. (2016). Introduction to Linear Algebra, 5th edition. Wellesley-Cambridge Press. MIT OCW companion course is free and excellent.
Further reading
Strang, G. (online course MIT 18.06 Linear Algebra). The reference course on linear algebra in major universities, freely available. ocw.mit.edu
3Blue1Brown, series Essence of Linear Algebra. Remarkable visualisations of the key concepts. youtube.com/playlist
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Chapter 2 on ML-specific linear algebra. deeplearningbook.org
Lay, D. C., Lay, S. R. & McDonald, J. J. (2021). Linear Algebra and Its Applications, 6th edition. Pearson. A classical reference for applied linear algebra, balanced between theory and exercises.
Axler, S. (2024). Linear Algebra Done Right, 4th edition. Springer. A conceptual approach that delays determinants as long as possible and emphasises the structure of vector spaces. Free open-access PDF
Quiz
1. What does the dot product of two vectors return?
2. If two vectors are orthogonal, their dot product equals:
3. The norm of a vector x = (x₁, x₂, ..., xₙ) is:
4. Why do a neuron's weights share the dimension of its inputs?
5. What does the matrix-vector operation W·x compute, where W is m×n and x has dimension n?
6. If A is a 3×4 matrix and B a 4×2 matrix, what is the size of the product AB?
Quiz: chapter 2 recap (interactive)
Multiple-choice quiz checking the chapter's takeaways: the vector dot product, the geometric formulation x·w = ‖x‖ ‖w‖ cos(θ), orthogonality, the matrix-vector product and the dimension of the matrix-matrix product. To be taken online for automatic scoring.