01 / 06 Embeddings and the geometry of similarity
  1. ← Searching by meaning: vector databases and retrieval
  2. 00 Foreword
  3. 01 Embeddings and the geometry of similarity
  4. 02 Exact search and the curse of dimensionality
  5. 03 HNSW: navigating a proximity graph
  6. 04 The landscape of ANN indexes
  7. 05 Testing the approximate: the differential oracle
Searching by meaning: vector databases and retrieval · 01 / 06

Embeddings and the geometry of similarity

Meaning as a position in space, and three ways to measure how close two meanings are.

The foreword made a promise: to search by meaning, not by characters. But a promise is not a method. If meaning must become a position in space, then two very concrete questions arise. What exactly is a position here? And how do we measure that two positions are close, given that a perfectly aligned but distant word might well lose to an approximate but very nearby one?

This chapter answers both. It lays the building block on which the entire course rests: representing a text as a vector, and choosing the right rule to measure their proximity.

Meaning as a position

Here is the founding idea, and it is older than modern computing. The linguist John Firth summed it up in 1957 with a phrase that has remained famous: you shall know a word by the company it keeps (Firth, 1957). “Cat” and “feline” appear in the same contexts, surrounded by the same neighbors (“purrs”, “paws”, “whiskers”), so they have a similar meaning. This idea has a name, the distributional hypothesis Distributional hypothesis The founding idea of vector semantics: a word is characterized by the contexts in which it appears, so words that share contexts have close meanings. Summarized by Firth's phrase, you shall know a word by the company it keeps. This principle is what justifies learning embeddings where geometric proximity encodes proximity of meaning. Source: Firth, 1957 , and it is what makes semantic search possible.

How do we turn it into numbers? We hand that work to a neural network, exactly the building block from the course “Neural Networks: Foundations and Mathematics”. By processing large amounts of text, the network learns to assign to each word, each sentence, an embedding Embedding A representation of an object (word, sentence, image) as a vector of real numbers, learned by a neural network so that geometric proximity reflects proximity of meaning. Two texts with close meanings get close vectors. Typical dimensions range from a few hundred to a few thousand (for example 768 or 1536). Source: Mikolov et al., 2013 : an ordered list of numbers, that is, a vector Vector A mathematical object represented as an ordered list of numbers. A vector of dimension n encodes n values. In machine learning, a neuron's inputs and weights are each a vector of the same dimension. , chosen so that similar meanings receive similar vectors. We do not redo that learning here, which belongs to the foundations course. We start from the result: text goes in, a vector comes out.

Three ways to measure proximity

We have two vectors. We want a number that says how close they are. There are three classic measures, and the choice between them is not a detail: it changes who wins.

The dot product

The first building block is the dot product Dot product An operation taking two vectors of equal dimension and returning a single number, computed as the sum of element-wise products. It is exactly the computation a neuron performs between its inputs and weights. . We multiply the two vectors coordinate by coordinate, then add everything up.

xy=i=1nxiyi\mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{n} x_i \, y_i

This equation reads: the dot product of x and y is the sum, for each dimension i, of the product of the i-th coordinate of x by the i-th coordinate of y. It is a single number. It is large when the two vectors are both long and pointing in the same direction, and it equals zero when they are perpendicular.

The norm and the Euclidean distance

The length of a vector, its norm, follows from the dot product of the vector with itself.

x=xx=i=1nxi2\lVert \mathbf{x} \rVert = \sqrt{\mathbf{x} \cdot \mathbf{x}} = \sqrt{\sum_{i=1}^{n} x_i^2}

This is the Pythagorean theorem in n dimensions: the length is the square root of the sum of the squared coordinates. From this comes the most intuitive measure of proximity, the Euclidean distance Euclidean distance The distance between two vectors u and v in Rⁿ, defined as the norm of their difference: d(u, v) = ‖u - v‖. It generalises the distance between two points of the plane to n dimensions. Used to measure similarity between two vector representations. : the length of the segment separating the two points.

d(x,y)=xy=i=1n(xiyi)2d(\mathbf{x}, \mathbf{y}) = \lVert \mathbf{x} - \mathbf{y} \rVert = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}

This reads: the distance between x and y is the square root of the sum of squared coordinate-by-coordinate differences. The smaller it is, the closer the points are. Watch the direction: here, small means close, the opposite of the dot product.

Cosine similarity

Euclidean distance has a flaw for comparing meanings: it mixes two pieces of information, direction and length. Yet for a text, it is often the direction that carries the meaning, not the length of the vector. We therefore want a pure angle measure. That is cosine similarity Cosine similarity A measure of proximity between two vectors, defined as cos(θ) = (x · w) / (‖x‖ ‖w‖), a value in [-1, 1]. Equal to 1 when the vectors are aligned, 0 when orthogonal, -1 when opposite. A standard tool for comparing embeddings (words, sentences, images). : the cosine of the angle between the two vectors.

cosθ=xyxy\cos\theta = \frac{\mathbf{x} \cdot \mathbf{y}}{\lVert \mathbf{x} \rVert \, \lVert \mathbf{y} \rVert}

This reads: the cosine of the angle is the dot product divided by the product of the two lengths. You recognize the dot product, but stripped of the length effect by the division. The result lives in the interval from minus one to one: it equals one when the vectors point in the same direction, zero when they are perpendicular, and minus one when they are opposite.

cos(θ) = (x · y) / (‖x‖ × ‖y‖), between -1 and 1
Read it aloud

Play with the two arrows below. Watch the dot product, the norms, and the angle change together. Notice that stretching one arrow without rotating it makes the dot product climb but leaves the angle unchanged: that is exactly the difference cosine erases.

xwxy
Dot product x·w = 1.44
Normx‖ = 1.34 w‖ = 1.34
Angle θ = 36.9°

Play with the coordinates. When the arrows point the same way, the dot product is maximal. When they are perpendicular, it is zero.

When the measure changes the winner

Here is the moment that matters. As long as you compare two vectors, the choice of measure seems academic. But as soon as you rank several candidates against a query, the measure decides the ranking, and two measures can point to two different winners.

In the component below, a query (in color) faces several candidates. You can move the query, choose the measure, and toggle normalization on or off. The ranking recalculates live.

xyautomobilejalopyenginebananacar (query)
Proximity measure

Candidate ranking

  1. 1automobilescore 1.00
  2. 2jalopyscore 1.00
  3. 3enginescore 0.79
  4. 4bananascore -0.26

Switch between cosine and Euclidean distance without normalization: "automobile" (aligned but long) and "jalopy" (close but short) swap positions. Then enable normalization and watch the two measures converge.

Three questions to ask yourself while exploring:

  1. Without normalization, who wins with cosine, “automobile” or “jalopy”? And with Euclidean distance? Why the disagreement?
  2. What happens to that disagreement when you enable normalization?
  3. Does “banana” ever rise in the ranking, regardless of the measure? What does its position tell you about the meaning assigned to it?

Under the hood: why normalizing reconciles everything

The component reveals a striking fact: once vectors are normalized, the cosine ranking and the Euclidean distance ranking become identical. This is not a coincidence, and the reason fits in two lines of calculation.

Normalizing Normalization The operation that brings a vector to length 1 by dividing it by its norm, without changing its direction. On such normalized vectors, cosine similarity reduces to the dot product, and ranking by cosine coincides with ranking by Euclidean distance. This is why many vector databases normalize embeddings on ingestion. a vector means bringing it to a length of one by dividing by its norm, without changing its direction. Let us therefore place ourselves on the unit sphere, where all vectors have length one. The squared distance between two such vectors expands as follows:

d(x,y)2=x2+y22(xy)=22cosθd(\mathbf{x}, \mathbf{y})^2 = \lVert \mathbf{x} \rVert^2 + \lVert \mathbf{y} \rVert^2 - 2\,(\mathbf{x} \cdot \mathbf{y}) = 2 - 2\cos\theta

Since both norms equal one, only two minus two times the cosine remains. In other words:

cosθ=1d22\cos\theta = 1 - \frac{d^2}{2}

This relation says it all: on the unit sphere, the smaller the angle (cosine close to one), the smaller the distance. The two measures vary in opposite directions but in a strictly monotone way, so they produce exactly the same ranking. This is the deep reason why many vector databases normalize embeddings on ingestion: they can then use Euclidean distance, often simpler to index, while reasoning in effect on angles.

A surprise waiting in high dimension

All our intuition comes from the plane, where we draw arrows. It will betray us in the next chapter. In high dimension, a counter-intuitive phenomenon takes hold: if you draw two vectors at random in a thousand-dimensional space, they are almost always nearly perpendicular, so their cosine is close to zero. Distances, meanwhile, all look alike. The space becomes strangely empty and uniform. This phenomenon has a name, the curse of dimensionality, and it alone decides the difficulty of all vector search. We will face it head-on in chapter 2.

Exercises

In one sentence

An embedding places meaning in a geometric space, and the measure we choose to compare those positions, dot product, distance, or cosine, decides the ranking: for the meaning of a text, it is the angle that matters, which normalization makes explicit.

Quiz
  1. 1. Why do we often prefer cosine similarity over Euclidean distance for comparing text embeddings?

  2. 2. What is the cosine similarity of two perpendicular vectors?

  3. 3. After normalizing all vectors, what can be said about the cosine ranking and the Euclidean distance ranking?

Toward chapter 2

We now know how to represent meaning as a vector and measure proximity between two vectors. But a real engine does not compare two vectors: it has millions of them, and for each query it would theoretically need to scan them all. This exhaustive search is exact, but its cost grows with the number of vectors. Worse, the geometric intuition of the plane breaks down in high dimension, where real embeddings live. Chapter 2 faces both walls head-on: the cost of exact search, and the curse of dimensionality that makes the problem so peculiar.

Sources

  • Firth, J. R. (1957). “A synopsis of linguistic theory 1930-1955.” In Studies in Linguistic Analysis, 1-32. Blackwell.
  • Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). “Efficient Estimation of Word Representations in Vector Space.” arXiv:1301.3781
  • Reimers, N. & Gurevych, I. (2019). “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP. arXiv:1908.10084

Further reading

  • Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing (3rd edition draft), ch. 6 “Vector Semantics and Embeddings”. A reference textbook, accessible and free online. web.stanford.edu/~jurafsky/slp3
  • Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval, chap. 6 (Scoring, term weighting and the vector space model). nlp.stanford.edu/IR-book