Searching by meaning: vector databases and retrieval · 05 / 06

Testing the approximate: the differential oracle

An index can pass every test, two reviews, and still return bad results. How do you catch an algorithm that lies?

In the previous chapter, we placed HNSW right next to the ideal corner of the triangle, and one word came back with every measurement: we evaluate the recall of an approximate index by comparing it to the exact oracle. But let us pause on that word. When your engine reports a recall of 0.95, who guarantees it is not 0.62? An approximate search index does not throw an exception when it is wrong: it always returns k results, neatly sorted, looking perfect. It can pass every unit test and two code reviews, and silently miss a third of the true neighbors.

This chapter is the course’s judge. One question carries it, and it is unsettling: how do you test an algorithm whose right answer you do not know in advance, and which can be structurally flawless yet globally wrong?

The algorithm that passes the tests and lies anyway

Here is a true story, which happened while building the engine that runs as the thread of this course. A freshly written HNSW index passed the entirety of its unit tests. It did return k results. They were sorted from closest to farthest. The obvious neighbor of an easy query was found. Two reviews, one human and one assisted, had read it without flagging anything. Everything was green.

Its real recall was 0.62, where we were aiming for 0.90 and above. It silently threw away nearly four out of ten true neighbors. No test screamed, because no test looked at the only thing that mattered: the global quality of the result, measured against the truth.

Two tiny bugs hid there, invisible on reading. A binary heap built backwards, which on every saturation discarded the best candidate instead of the worst, so the search kept the bad ones and grew poorer as it advanced. And a connectivity bound halved by an extra .min(), which fragmented the graph. Two lines. Forty percent of neighbors lost.

Local property versus global quality

Why did the unit tests see nothing? Because they check local properties, and an approximate algorithm can satisfy all of them while collapsing globally.

A local property is a statement about the SHAPE of the result. The result holds k elements. They are sorted. They are distinct. All true here, in both buggy versions as in the correct one: the result stays structurally impeccable. These tests do not lie, they are simply blind to the question that matters.

Global quality, on the other hand, asks: among the k true nearest neighbors, how many did we retrieve? That is recall, defined in chapter 2. No example-based test reveals “we miss 38 % of the true neighbors”, because that truth is not local: it appears only by aggregating dozens of queries and confronting them with the right answer.

The differential oracle

The idea is as simple as it is powerful. We have two implementations of the same problem. One is fast but approximate, the one we want to ship. The other is slow but exact, and serves only as a reference. We run both on the same dataset, and measure the gap on the metric that truly matters. That is the differential oracle .

quality = gap(fast-approximate output , slow-exact output) on the metric that matters

The principle

For vector search, the exact implementation already exists, and we know it well: it is the exhaustive search of chapter 2, the flat index that compares the query against every vector. Slow, hence unusable in production, but exact by construction. It is the ground truth. The metric that matters is recall@k. The differential oracle of a vector index is therefore: for a sample of queries, compare the k neighbors returned by the fast index to the k true neighbors returned by the exhaustive scan, and average the recall.

This is not one more unit test. It is a test of a different nature, one that measures a continuous quality rather than a binary correctness, and that needs a second implementation to exist.

Watching the tests stay green while recall lies

The component below puts you in the shoes of someone debugging. A small beam-search index runs over a cloud of points, against the exhaustive oracle. On the left, the panel of local unit tests. On the right, the recall measured by the differential oracle, with its contractual threshold. Toggle either of the two real bugs, the inverted heap or the halved connectivity, and observe: the local tests stay stubbornly green while recall collapses under the threshold. This is exactly what you do not see when reading the code.

Beam width (ef) 14

QueryReturned neighborsTrue neighbors (oracle)Visited points

Click in the frame to move the query.

Unit tests (local properties)

Returns k results
Sorted by distance
All distinct

Tests green

Differential oracle (global quality)

Mean recall@k100 %

Recall of this query100 %

Contractual threshold90 %

Contract kept

The local tests stay green. Only the differential oracle sees the drop.

Three questions to ask yourself while playing:

Turn on the inverted heap. Which unit test turns red? None. By how much does the mean recall drop?
Switch off the bugs and move the query into a dense area, then into an isolated corner of the cloud. Would the recall of ONE easy query be enough to reassure you? Why does the oracle average over many queries?
With connectivity halved, look at the number of visited points. Does the fragmented graph visit more or fewer nodes, and what happens to recall?

The reflex generalizes very far

The differential oracle has nothing specific to vector databases. As soon as you have a fast version and a slow-but-safe version of the same computation, you hold an oracle. A cache against its source of truth. Optimized code against its naive implementation. A heuristic against the exact solution on small cases. Each time, the slow reference is the arbiter of the fast one.

When even the exact reference is too costly, there is still a way out: metamorphic testing . Rather than knowing the right answer, you check a RELATION that must hold between several runs. Sorting a list then sorting it again must change nothing. Permuting the order of inputs must not change the set of neighbors found. Doubling every coordinate must leave the ranking intact. These relations catch bugs without ever requiring the full oracle. The differential oracle is its simplest case, where the checked relation is equality to an exact reference.

Exercises

Solution to exercise 1: computing a recall and seeing the blind spot

On a query, the exact index (the oracle) gives as true 5 nearest neighbors the identifiers $\{3, 7, 1, 9, 2\}$ . The fast index returns $\{3, 7, 4, 9, 8\}$ , sorted from closest to farthest. We want recall@5, then to check the local properties.

Step 1. Recall@k is the fraction of true neighbors actually retrieved. We count how many elements of the oracle are present in the fast answer.

\{3, 7, 1, 9, 2\} \cap \{3, 7, 4, 9, 8\} = \{3, 7, 9\}

Step 2. There are 3 common elements out of 5 true neighbors.

\text{recall@5} = \frac{3}{5} = 0.6

Step 3. Now we check the local properties of the fast answer. It does hold 5 identifiers: “returns k results” is green. They are announced sorted by distance: “sorted” is green. They are all different: “distinct” is green.

Result. Recall is only $0.6$ , yet the three local tests pass. An example-based test that merely checked the shape of the result would never see that two true neighbors out of five are missing. Only the differential oracle, which knows $\{3, 7, 1, 9, 2\}$ , catches the drop.

Solution to exercise 2: designing the oracle of a cache

We wrote a cache that memoizes the result of a costly function f(x). We want a differential oracle to make sure it never returns a stale value. We have f itself, slow but correct.

Step 1. We identify the two implementations. The fast one is the cache (cache.get(x)). The slow-but-exact one is the original function (f(x)), which here is our ground truth.

Step 2. We choose the metric that matters. For a cache, it is strict equality: for every drawn x, the cache value must be exactly that of f.

\text{for each } x : \quad \text{cache.get}(x) = f(x)

Step 3. We avoid the self-reference trap. The oracle must NEVER compare the cache to itself, nor fill the expected value with the cache’s output: that would be a tautology freezing the bug. The reference is f, computed independently.

Result. The cache’s differential oracle draws varied x, computes f(x) separately, and demands equality. A stale entry (the cache returns the old value after f changes) makes the two diverge and turns red. The principle is the same as for the vector index: a slow, independent reference arbitrates the fast version.

In one sentence

An approximate algorithm can satisfy all its local properties and pass its reviews while collapsing in global quality: only a differential oracle, which confronts the fast version with a slow-but-exact reference on the metric that matters, catches this silent lie, and the threshold of that metric is a contract you never lower to silence the alarm.

Quiz

1. Why can an approximate search index pass all its unit tests yet be wrong?
2. What is a differential oracle for a vector index?
3. The differential oracle turns red below the promised recall threshold. What do you do?

Towards chapter 6

We now know how to trust an in-memory index: the differential oracle certifies that it really retrieves what it claims. But this index lives in RAM, and RAM is wiped on the slightest restart. Chapter 6 tackles durability: how do you write an HNSW graph to disk without ever corrupting it, even if the machine crashes mid-insertion? We will find there, transposed to persistence, exactly the reflex of this chapter: a round-trip differential oracle, which reloads the index from disk and demands that it be identical, byte for byte, to the one we had in memory. The same idea, from the geometric world to the world of storage.

Sources

Barr, E. T., Harman, M., McMinn, P., Shahbaz, M. & Yoo, S. (2015). “The Oracle Problem in Software Testing: A Survey.” IEEE Transactions on Software Engineering 41(5), 507-525. DOI 10.1109/TSE.2014.2372785
Chen, T. Y., Cheung, S. C. & Yiu, S. M. (2018). “Metamorphic Testing: A Review of Challenges and Opportunities.” ACM Computing Surveys 51(1). DOI 10.1145/3143561

Going further

Claessen, K. & Hughes, J. (2000). “QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs.” ICFP. DOI 10.1145/351240.351266