Foreword
Why search by meaning, what this course covers, and how to read it.
You type “car” into a search engine, and a document that only talks about “automobile” slips right through your fingers. Yet they mean the same thing. The engine was not looking for meaning: it was looking for a sequence of characters. Those two words share no letters in common at the right positions, so to the engine they have nothing to do with each other. This course fixes exactly that flaw.
We are going to learn how to search by meaning. To do that, we first need to turn text into a geometric object, a point in a space, in such a way that two texts with similar meaning become two nearby points. This object has a name, the embedding, and it is the first building block of everything else. Chapter 1 defines it in detail.
Two ways to search
There are two broad families of search, and the entire course lives in the tension between them.
The first is lexical search: you compare words, characters. It is fast, it is exact, and it is unbeatable for finding a precise identifier or a rare word. But it is blind to synonyms: “car” and “automobile” are strangers to it, “cat” and “feline” too.
The second is semantic search: you compare meanings. It finds “automobile” starting from “car”, and even “how to fix an engine that stalls” starting from “my ride sputters at startup”. Its price: you have to represent meaning as numbers, and accept that you are no longer looking for an exact match but for proximity.
This second family has exploded for a very concrete reason. Language models need, in order to answer correctly, to retrieve the right passages from a large body of documents. This marriage between meaning-based search and a model that writes has a name, RAG, and it is the destination of this course.
The journey
The course follows a simple thread: represent meaning, retrieve it quickly, make it robust, then put it at the service of a model. Each chapter answers a limitation left open by the previous one.
Block 1: representing and measuring meaning
How a text becomes a vector, and how we measure that two vectors are close. Then exact search, its perfect guarantee, and the wall it crashes into when vectors number in the millions and live in high dimension.
Block 2: retrieving fast, without comparing everything
We give up exactness to gain speed. The HNSW graph and its neighbor-by-neighbor navigation, the landscape of index families and their trade-offs, and above all the tool that tells you whether an approximate index is good or silently lying.
Block 3: hardening, combining, serving
Making the index durable without corrupting it on the first crash, marrying it to lexical search for the best of both worlds, and finally connecting everything to a language model.
Who this course is for
- You already use a vector database (pgvector, Qdrant, FAISS…) without really knowing what runs underneath. You will finally see inside the box.
- You are building a RAG system and you want to understand why your results are sometimes off target. The answer is almost always in the chain we unpack here.
- You love seeing bridges between an abstract idea (meaning as geometry) and concrete data structures. That is precisely the journey of this course.
This course is not a tutorial for any particular library, nor a survey of the most recent research, which moves too fast for a course.
In one sentence
Searching by meaning means representing each text as a point in a space, measuring proximity between those points, then building all the tooling that allows finding the closest ones at scale, quickly and without mistakes.
On to chapter 1
Everything starts from an almost philosophical question. If meaning must become a position in space, then what exactly is a position here, and how do we measure that two positions are close? Should a perfectly aligned but distant word beat an approximate but very close word? That is where, in the geometry of similarity, chapter 1 begins.
Sources
- Firth, J. R. (1957). “A synopsis of linguistic theory 1930-1955.” In Studies in Linguistic Analysis, 1-32. Blackwell. (Source of the distributional hypothesis.)
- Mikolov, T., Chen, K., Corrado, G. & Dean, J. (2013). “Efficient Estimation of Word Representations in Vector Space.” arXiv:1301.3781
- Lewis, P. et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” NeurIPS. arXiv:2005.11401
Further reading before chapter 1
- Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. The reference in the field, free online. nlp.stanford.edu/IR-book