02 / 06 Exact search and the curse of dimensionality
  1. ← Searching by meaning: vector databases and retrieval
  2. 00 Foreword
  3. 01 Embeddings and the geometry of similarity
  4. 02 Exact search and the curse of dimensionality
  5. 03 HNSW: navigating a proximity graph
  6. 04 The landscape of ANN indexes
  7. 05 Testing the approximate: the differential oracle
Searching by meaning: vector databases and retrieval · 02 / 06

Exact search and the curse of dimensionality

Comparing every vector gives the perfect answer. Here is its price, and the trap that high dimension sets for our intuition.

In the previous chapter, we learned to represent meaning as a vector and to measure proximity between two vectors. But a real engine never compares just two: it has millions, and for each query it would need, in principle, to scan them all. Two questions arise. Is comparing everyone really a problem? And does our lovely planar intuition, made of arrows we draw, still hold when vectors live in a thousand dimensions?

This chapter looks both walls in the face. The first is a cost wall: the perfect search exists, but it comes at a price. The second is more insidious, because it defies intuition: in high dimension, space behaves in a way that will surprise you.

Exact search: comparing every vector

Let us start with the simplest approach, which is also the safest. We have a query, in the form of a vector, and a database of millions of vectors. We want the closest ones. The obvious method: compute the distance from the query to each vector in the database, then keep the best ones.

Finding the kk closest elements to a query has a name: it is the problem of nearest neighbors Nearest neighbors The problem of finding, among a set of points, the k points closest to a query under a distance or similarity measure. In vector search, k nearest neighbors (k-NN) means the k documents whose embedding is closest to the query's. . And the method we just described, comparing the query to everyone, is called exhaustive search Exhaustive search A strategy that compares the query against every vector in the database, one by one, to extract the closest ones. Also called linear scan or Flat index. It is exact by construction (it cannot miss anything) but its cost grows linearly with the number of vectors and their dimension, in O(n x d). It serves as the reference oracle for judging approximate methods. , or linear scan, or flat index in the jargon of vector databases.

It has one quality that nothing else will ever possess as fully: it is exact. Since it examines every vector, it cannot, by construction, miss any neighbor. The ranking it returns is the truth. Hold on to that word, truth, because it will become precious: from chapter 3 onwards, we will look for faster methods that accept being slightly wrong. To know whether they are wrong, we will need a perfect reference to compare them against. That reference is exactly exhaustive search. We say it serves as an oracle.

The cost wall

If exhaustive search is perfect, why does the rest of this course exist? Because it is slow, and we can say exactly how slow.

To compare the query to one vector in dimension dd, we must traverse all dd coordinates: on the order of dd operations. To compare it to nn vectors, we repeat nn times. The total cost of one query is therefore proportional to n×dn \times d.

cost of one queryn×d\text{cost of one query} \propto n \times d

This expression reads: the cost grows as the product of the number of vectors by their dimension. We denote this behavior O(n×d)O(n \times d). When the database is small, nobody complains. But put in realistic numbers. A database of ten million documents, embeddings of dimension 1536: a single query requires on the order of fifteen billion multiplications. If your service receives one thousand queries per second, do the math, the wall arrives fast.

The curse of dimensionality

One might think the only problem is cost, and that being smarter would be enough to avoid comparing everything. That is the idea behind the following chapters. But before pursuing it, we need to understand why being smart is so difficult. The reason has nothing to do with the speed of machines: it lies in the geometry itself.

In low dimension, in the plane or in space, our intuition works. There are close points and distant points, dense regions and sparse regions. Finding the nearest neighbor has a clear meaning. In high dimension, this intuition breaks down. The set of phenomena that arise then has a name: the curse of dimensionality Curse of dimensionality A set of counterintuitive phenomena that appear when the number of dimensions grows large. In vector search, two effects dominate: distances between randomly drawn points concentrate (nearest and farthest become almost indistinguishable), and two random vectors are almost always nearly perpendicular. This is what makes nearest-neighbor search hard in high dimension. . Two of them alone decide the difficulty of all vector search.

The first: distance concentration Distance concentration The phenomenon by which, in high dimension, distances between randomly drawn points tighten around a common value. The relative contrast (maximum minus minimum distance, divided by the minimum) tends to zero like the inverse of the square root of the dimension. As a result, the notion of a nearest neighbor loses meaning when all distances look alike. . If you draw random points and measure their distances to a reference point, those distances, in high dimension, all look alike. The nearest is barely closer than the farthest. The very notion of nearest neighbor becomes blurry.

The second: quasi- orthogonality Orthogonality Two vectors are orthogonal when their dot product is zero. Geometrically, this matches a 90-degree angle between them. In machine learning, orthogonal inputs contribute independently to a neuron's computation. . Two vectors drawn at random in high dimension are almost always perpendicular, hence with a cosine similarity close to zero. Space is so vast that two directions taken at random almost never have anything in common.

Both claims are surprising. Do not take them on faith: go and see.

Seeing the curse

The component below draws a cloud of points whose every coordinate is random, in the dimension you choose. The left tab shows the distribution of distances to a query; the right tab shows the distribution of cosines between pairs of points. Slide the dimension from 1 up to 1000 and watch the two histograms transform.

0173350Number of points0.040.931.812.703.59Distance to query
Contrast (max - min) / min 90.648Relative spread 0.524
Distance from each point to the query. As dimension rises, the histogram narrows: the nearest and the farthest draw closer to each other, and the contrast collapses.

Three questions to ask yourself while exploring:

  1. As dimension climbs, does the distance histogram shift to the right, and what happens to its width? What is the contrast at dimension 2, and then at dimension 500?
  2. On the cosine tab, around what value does the bell center? What happens to its spread as dimension increases?
  3. At very high dimension, if all distances are roughly equal and all angles are right angles, how could an algorithm still guess where to look for the nearest neighbor without comparing everything?

Under the hood: why distances concentrate

The component shows a fact, it does not explain it. The explanation is elegant, and it fits in a few lines. It rests on an idea you may already know: when you add many small independent contributions together, the total becomes remarkably stable.

Take two points x\mathbf{x} and y\mathbf{y} whose every coordinate is drawn at random, independently, from the same distribution. Their squared distance, that is the square of the norm Norm The length of a vector, measured as the square root of the sum of its squares. For a vector x = (x₁, ..., xₙ), the norm ‖x‖ = √(x₁² + ... + xₙ²). It is the generalisation of the Pythagorean theorem to n dimensions. of their difference, is, as we saw in chapter 1, a sum over the dd dimensions.

d(x,y)2=i=1d(xiyi)2d(\mathbf{x}, \mathbf{y})^2 = \sum_{i=1}^{d} (x_i - y_i)^2

This equation reads: the squared distance is the sum, over each of the dd dimensions, of the square of the gap between the two points in that dimension. Look carefully at each term in the sum: (xiyi)2(x_i - y_i)^2. Since the coordinates are drawn independently from the same distribution, these dd terms are themselves independent and identically distributed. Call mm the mean of a single term and vv its variance. These are fixed numbers, independent of the dimension.

A sum of dd independent identically distributed terms has an expectation and a variance that add up term by term.

E[d(x,y)2]=d×m\mathbb{E}\big[d(\mathbf{x}, \mathbf{y})^2\big] = d \times m

This line reads: on average, the squared distance equals dd times the mean of one term. It therefore grows proportionally to the dimension. That is why, in the component, the histogram shifts to the right as dd increases: points are on average farther apart.

Var[d(x,y)2]=d×v\mathrm{Var}\big[d(\mathbf{x}, \mathbf{y})^2\big] = d \times v

The variance also grows like dd. But variance is not the right quantity to judge spread: we need the standard deviation, which is its square root.

σ[d(x,y)2]=d×v=vd\sigma\big[d(\mathbf{x}, \mathbf{y})^2\big] = \sqrt{d \times v} = \sqrt{v}\,\sqrt{d}

The standard deviation therefore grows like d\sqrt{d}, not like dd. That is the decisive point. Let us compare the spread to the typical value by forming their ratio, which is called the relative spread.

σmean=vdm×d=vm×1d\frac{\sigma}{\text{mean}} = \frac{\sqrt{v}\,\sqrt{d}}{m \times d} = \frac{\sqrt{v}}{m} \times \frac{1}{\sqrt{d}}

This last equation holds the entire secret. It reads: the relative spread of squared distances is a fixed constant times one over the square root of dd. As dimension explodes, this factor 1/d1/\sqrt{d} crushes everything: the relative spread tends to zero.

relative spread proportional to 1 / sqrt(d), which tends to 0 as d grows
Read it aloud

In other words, all distances grow together, but their gap relative to their size disappears. The cloud of all distances shrinks to a thin shell around a common value. The nearest and the farthest become almost indistinguishable. This is not a flaw in the machines or the data: it is a property of the geometry in high dimension, demonstrated here in a few lines.

Measuring an imperfect search: recall at k

We now have the two reasons to abandon exhaustive search in production: it is too slow, and high dimension makes the terrain hostile. The following chapters will therefore build faster methods that accept occasionally missing a neighbor. But as soon as we accept being wrong, one question becomes unavoidable: wrong by how much?

We need a quality measure. The most common is recall at k Recall@k A measure of the quality of an approximate search: the fraction of the k true nearest neighbors (computed by exhaustive search) that the approximate method recovers in its top k results. A recall@k of 1 means no exact neighbor was missed; a recall@k of 0.8 means one exact neighbor in five escaped the search. . Its principle is direct, and it relies exactly on the oracle we talked about. We ask exhaustive search for the kk true nearest neighbors: that is the truth. We then ask the fast method for its kk best results. Recall at k is the fraction of the truth that the fast method recovered.

recall@k=number of true neighbors found in the top kk\text{recall@}k = \frac{\text{number of true neighbors found in the top } k}{k}

This formula reads: recall at k is the number of true neighbors that the fast method placed in its top kk, divided by kk. A recall at k of one means it missed nothing. A recall at k of 0.9 means one exact neighbor in ten slipped through. That is the unit of account that will follow us to the end of the course: every time we gain in speed, we will ask what it costs in recall.

Exercises

In one sentence

Exhaustive search always returns the exact answer and serves as an oracle, but its O(n×d)O(n \times d) cost and the curse of dimensionality, which concentrates distances and makes vectors quasi-perpendicular, force us to build faster indexes whose error we will measure by recall at k.

Quiz
  1. 1. Why do we say that exhaustive search is exact?

  2. 2. How does the cost of one query grow under exhaustive search?

  3. 3. As dimension grows, what happens to the relative spread of distances between random points?

Towards chapter 3

We now know two uncomfortable things: comparing all vectors is exact but too slow, and high dimension blurs the very notion of proximity. How, then, can we find nearest neighbors without scanning everything? The idea of chapter 3 is surprisingly simple to state: what if vectors were linked to each other by a network of shortcuts, so that starting from any point and always moving towards a neighbor closer to the query, you arrive in a few hops near the goal? That navigable network exists, it is called HNSW, and it turns an O(n)O(n) search into a stroll of a few steps. We enter it in chapter 3.

Sources

  • Bellman, R. (1961). Adaptive Control Processes: A Guided Tour. Princeton University Press. (Origin of the expression “curse of dimensionality”.)
  • Beyer, K., Goldstein, J., Ramakrishnan, R. & Shaft, U. (1999). “When Is ‘Nearest Neighbor’ Meaningful?” ICDT. DOI 10.1007/3-540-49257-7_15
  • Aggarwal, C. C., Hinneburg, A. & Keim, D. A. (2001). “On the Surprising Behavior of Distance Metrics in High Dimensional Space.” ICDT. DOI 10.1007/3-540-44503-X_27

Further reading

  • Indyk, P. & Motwani, R. (1998). “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality.” STOC. DOI 10.1145/276698.276876
  • Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval, chap. 7. nlp.stanford.edu/IR-book
  • Bruch, S. (2024). Foundations of Vector Retrieval. Springer. A modern, rigorous synthesis of vector retrieval, from exact scan to approximate indexes. arXiv:2401.09350