Foundations
Hybrid Search and Reranking
・
Finding the right vector is only half the battle. While embeddings provide a powerful way to navigate the "meaning" of data, they are not a universal solution for every retrieval challenge. In production environments, relying solely on semantic similarity often leads to surprising failures, especially when queries involve technical jargon, product codes, or specific names.
To solve this, modern search architecture often applies a two-stage process: Hybrid Search followed by Reranking.
Hybrid Search
Hybrid searchis the practice of running two different search methodologies in parallel and merging their results into a single list. It combines the semantic depth of vector search with the literal precision of keyword search.
Even the most advanced embedding models can struggle with "out-of-vocabulary" terms. For example, if a user searches for a specific error code like ERR_90210, a vector model might retrieve documents about "general system errors" because it recognizes the concept of a failure.
A keyword search, however, will find the exact manual entry for that specific code instantly. By using both, the system ensures that it captures both the broad intent of the original query and its specific details.
When you run two searches, you end up with two different lists of results, each with its own scoring system. Keyword search uses BM25 scores, while vector search uses Cosine Similarity. Because these scales are mathematically different, you cannot simply add them together.
The industry standard for merging these lists is Reciprocal Rank Fusion (RRF).
Reciprocal Rank Fusionis a ranking algorithm that merges multiple result lists by using document rank positions instead of raw scores, rewarding items that rank highly across multiple retrieval methods.
Instead of looking at the raw scores, RRF looks at the rank of each document in both lists. A document that appears near the top of both the keyword list and the vector list receives a significantly higher final score than a document that only appears in one.
This approach is favored in production because it is robust, requires no manual tuning, and effectively balances the strengths of both retrieval methods.
Reranking
The final step in a high-performance retrieval pipeline is the Reranker, also known as a Cross-Encoder.
A
Reranker(orCross-Encoder) is computationally expensive model that jointly evaluates a query and each candidate result to produce a high-precision relevance score, used to reorder a small set of retrieved results so the most contextually correct information appears at the top.
While the initial retrieval stage (Bi-Encoders) is designed for speed, it often sacrifices some nuance to scan millions of documents quickly. A Reranker is a much more computationally expensive model that looks at the query and a document together at the same time.

Because it is slow, we do not use it to search the whole database. Instead, we take the top 50 or 100 results from our hybrid search and pass them to the Reranker.
The Reranker then performs a deep analysis of the relationship between the user's question and the content of each document, reordering them to ensure the most relevant information is at the very top of the list.
Context Window Optimization
In the context of an AI agent, every piece of information we retrieve occupies space in the model's limited attention span. Using hybrid search and reranking serves as a high-fidelity filter. By the time the information reaches the LLM, the "noise" has been stripped away, leaving only the high-signal facts required to generate an accurate response.
