04 / 06 The landscape of ANN indexes
  1. ← Searching by meaning: vector databases and retrieval
  2. 00 Foreword
  3. 01 Embeddings and the geometry of similarity
  4. 02 Exact search and the curse of dimensionality
  5. 03 HNSW: navigating a proximity graph
  6. 04 The landscape of ANN indexes
  7. 05 Testing the approximate: the differential oracle
Searching by meaning: vector databases and retrieval · 04 / 06

The landscape of ANN indexes

Four index families, three riches you can never keep all at once: how to choose between recall, speed and memory?

In the previous chapter, HNSW handed us a small miracle: finding a nearest neighbor in about logn\log n hops, instead of scanning the nn vectors one by one. We even closed the chapter on a confession: it is superb, but it is neither free nor unique. The hierarchy of graphs devours memory, updating it or filtering its results is delicate, and above all, other ways to avoid the full scan exist, making entirely different trade-offs.

This chapter takes a step back. Rather than diving into yet another algorithm, we will draw the map of all its cousins, place HNSW where it belongs, and give ourselves a reading grid to choose. A single question guides us, and it is uncomfortable: if no index can be exact, fast and light all at once, which one sacrifices what, and how do we decide?

Three riches, and the law that forbids cumulating them

When you index millions of vectors, you covet three things at once.

Recall: actually retrieving the right neighbors, without missing any. It is the quality of the answer, exactly the measure we laid down in chapter 2 by comparing an index to the exhaustive oracle.

Latency: answering fast, by touching as few vectors as possible. A search that compares the query against the whole database is slow; a search that touches only a handful is fast.

Memory: fitting the index into available RAM. An embedding vector of dimension 1536 in single precision already weighs 6 kilobytes; multiplied by a hundred million documents, you are talking hundreds of gigabytes.

Here is the cruel law of this field: you never maximize these three riches together. It is a triangle whose only one side you can occupy at a time. Gaining speed means giving up touching every vector, hence risking missing some, or else paying more memory for shortcuts. Saving memory means compressing, hence losing precision. Each index family is, at heart, a deliberate choice about which corner of the triangle you accept to lose.

Two questions, two orthogonal axes

To avoid getting lost in the zoology of indexes, a very simple grid suffices. Every ANN index answers, in its own way, two independent questions.

First question: how do you avoid scanning everything? Three possible answers. Do nothing special and go through it all (the flat index). Carve the space into cells and visit only the most promising ones (partitioning). Link the vectors with a graph and navigate it step by step (HNSW’s approach, seen in the previous chapter).

Second question: how do you store and compare the vectors? Two answers. Keep them in full precision, exact but heavy. Or compress them into tiny codes, light but approximate.

These two axes are orthogonal: you can combine any answer to the first with any answer to the second. This is precisely what explains the proliferation of real indexes, which are often just assemblies of these two bricks.

The reading grid: how to avoid the scan (rows) crossed with how to store the vectors (columns). Real indexes are born from the crossing of the two axes.

Let us now tour the four landmark families, one per striking cell of this grid.

Flat: the exact, the slow, the oracle

The flat index does nothing to avoid the scan: it compares the query against all vectors, in full precision. It is the exhaustive search of chapter 2. Its recall is perfect, by construction: it cannot miss anything. But it touches the nn vectors on every query, and stores them all in the clear.

cost=O(n×d)recall=1\text{cost} = O(n \times d) \qquad \text{recall} = 1

This line reads: the cost grows like the product of the number of vectors by their dimension, and the recall equals exactly one. It is the worst of worlds for speed and memory, the best for quality. Hence its true role: we keep it as the oracle, the ground truth against which we measure the recall of all the others.

IVF: partitioning space into districts

The first real idea for going faster: do not search everywhere. We carve the space into cells, like a city into districts, and file each vector into the district of its nearest center. This carving is computed once, by a clustering algorithm (k-means) that places the centers where the points pile up.

At search time, we first compare the query only to the district centers, few in number. We pick the few nearest districts, and scan only their residents. This count of visited districts is called nprobe\mathit{nprobe}. It is IVF IVF (inverted file) Inverted File. An index that partitions the space into cells, computed by k-means, and files each vector into the cell of its nearest centroid. At search time, only the nprobe cells closest to the query are scanned, not the whole database. IVF wins latency without reducing memory, since the vectors are still stored in full. The nprobe number tunes the trade-off between speed and recall. , for Inverted File.

The trade-off is plain. With a small nprobe\mathit{nprobe}, we touch few vectors, so we answer fast, but we risk missing a neighbor lurking in a district we did not visit: recall drops. By raising nprobe\mathit{nprobe}, we visit more districts, recall climbs back, until we visit everything and become exact again. IVF thus buys speed with a little recall. But notice what it does not touch: memory. The vectors stay stored in the clear, exactly as in the flat index. IVF wins latency, not RAM.

PQ: compressing to fit the elephant

The other great idea attacks the orthogonal axis: memory. What if, instead of storing each vector in full, we summarized it with a few bytes?

Product quantization Product quantization Product Quantization (PQ). A vector compression technique: each vector is split into several slices, and within each slice the sub-vector is replaced by the index of the nearest centroid in a small learned dictionary (a codebook). A vector thus becomes a handful of codes, often one byte each, instead of hundreds of reals. Product quantization saves a great deal of memory, at the cost of reduced recall, since distances are only estimated. proceeds like this. We split each vector into several slices. For each slice, we learn ahead of time a small dictionary of typical pieces (a codebook), again by k-means. A vector then becomes just a sequence of indices: for each slice, the number of the most similar typical piece. Where a vector of dimension 1536 weighed 6 kilobytes, its eight to sixteen codes fit in a handful of bytes. We divide memory by a hundred or more.

The price reads in the word quantization: we have replaced each slice with an approximation, so the computed distances are now only estimated. Recall drops. And above all, beware the trap: product quantization, alone, does not gain any speed. We still scan the nn codes, one by one. Each comparison has merely become very cheap. PQ wins memory, not latency.

Seeing the trade-off triangle live

The component below builds a real set of clustered vectors, then genuinely measures the four families on it: their recall against the exact oracle, the number of vectors they compare (the honest proxy for latency) and the memory of their index. Each family is a point in the recall-latency plane, and the size of its bubble tells its memory. Turn the knobs and watch the points slide along their trade-offs, without any of them ever reaching the ideal top-right corner with a tiny bubble.

0.000.250.500.751.000150300450600Recall@10 (rightward is better)Vectors compared (upward is faster)Flat (exact)IVF (partition)HNSW (graph)PQ (compressed)

Bubble size: index memory

recallcomparisonsmemory
Flat (exact)100.0 %60075.0 kB
IVF (partition)100.0 %11080.3 kB
HNSW (graph)100.0 %79118.3 kB
PQ (compressed)35.6 %6008.7 kB
  • Flat: perfect recall, but touches everything and stores everything.
  • IVF: fewer comparisons, full memory.
  • HNSW: very few comparisons, big memory bubble.
  • PQ: small bubble, but comparisons still of order n.

Nobody reaches the perfect corner: high recall, few comparisons, small bubble.

Three questions to ask yourself while playing:

  1. Push IVF’s nprobe\mathit{nprobe} to its maximum. Does its point reach Flat’s recall? What happens then to the number of comparisons?
  2. Compare the bubbles of HNSW and Flat. Which is the bigger, and why does a graph cost more memory than a plain array of vectors?
  3. Reduce PQ’s number of slices. Does its bubble shrink or grow, and what happens to its recall? Does PQ’s horizontal position really move on the comparisons axis?

The families marry

The most beautiful part is that these bricks combine, precisely because the two axes are independent. The most widespread index at very large scale, IVFPQ, partitions the space (for speed) and compresses the vectors (for memory): it wins two sides of the triangle at once, sacrificing more recall. You can likewise graft quantization under a graph. The reading grid is therefore not a shelf of rival products, but a box of Lego: you assemble a routing strategy and a storage strategy according to the rich you are willing to sacrifice.

Exercises

In one sentence

No index maximizes recall, latency and memory all at once: Flat is exact but heavy and slow, IVF buys speed by partitioning, HNSW buys it even better but pays in memory, product quantization buys memory but still scans everything, and real indexes combine these bricks according to the side of the triangle they accept to sacrifice.

Quiz
  1. 1. Why do we speak of a trade-off triangle for ANN indexes?

  2. 2. What does product quantization (PQ) gain, and not gain, when used alone?

  3. 3. On which two orthogonal axes does every ANN index fall?

Towards chapter 5

All along this chapter, one word kept coming back without our daring to look it in the eye: we measure the recall of an approximate index by comparing it to the exact oracle. But that oracle, precisely, who provides it, and at what cost? Honestly measuring the quality of an index supposes knowing the true answer, hence running a reference exhaustive search, and confronting it methodically with the approximate results. Chapter 5 builds this judge: the differential oracle, the test bench that pits the fast index against the slow truth, measures recall and latency side by side, and turns the choice of an index, until now intuitive, into a quantified decision. It is the climax of the course, where we stop taking an index at its word.

Sources

  • Jégou, H., Douze, M. & Schmid, C. (2011). “Product Quantization for Nearest Neighbor Search.” IEEE Transactions on Pattern Analysis and Machine Intelligence 33(1), 117-128. DOI 10.1109/TPAMI.2010.57
  • Johnson, J., Douze, M. & Jégou, H. (2021). “Billion-scale Similarity Search with GPUs.” IEEE Transactions on Big Data 7(3), 535-547. arXiv:1702.08734

Going further

  • Subramanya, S. J. et al. (2019). “DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node.” NeurIPS. Publisher link
  • Malkov, Y. A. & Yashunin, D. A. (2018). “Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.” IEEE TPAMI. arXiv:1603.09320