Hybrid Search & Reranking for RAG

When we started building Alex, Vera, and Zia at PrudAI, we thought—like most of the industry—that vector search was the silver bullet for Retrieval-Augmented Generation (RAG). You dump your docs into a vector database, calculate cosine similarity between the query and the chunks, and call it a day.

In practice, we quickly realized that for technical documentation and specific customer inquiries, this approach is fundamentally flawed. In this post, I’ll dive into the technical shortcomings of pure semantic search and explain why we overhauled our architecture to use Hybrid Search (BM25 + Vector) combined with Cross-Encoder Reranking.

The Semantic Trap of Vector Embeddings

Vector embeddings (like OpenAI’s text-embedding-3-small or various HuggingFace models) are excellent at understanding concepts. If a user asks about "login issues," a vector search will effortlessly find documents about "authentication errors." That is the power of semantics.

However, the moment you deal with technical data, this system fails in predictable ways. Suppose a user searches for a specific serial number, an error code like 0x8004210B, or a specific function name in a codebase. For an embedding model, these unique strings are often out-of-vocabulary or get compressed into a vector that is too close to other, irrelevant technical terms.

The result? The vector database returns the top-k results that feel similar but miss the exact chunk containing the fix. For our agent Alex, which automates technical support, that is unacceptable. In engineering, "close enough" is still wrong.

Bringing Back BM25: Precision Over Intuition

To solve this, we reintroduced BM25 (Best Matching 25) into our pipeline. BM25 is an evolution of TF-IDF and is a statistical method that looks at word frequency in a document relative to the entire corpus.

Unlike vector search, BM25 prioritizes rare tokens. If the term ERR_CONNECTION_RESET only appears in two documents, BM25 will surface them immediately. It doesn't understand the "meaning," but it understands the uniqueness. By running this lexical search in parallel with our vector search, we catch the specific terms that embeddings smooth over.

Hybrid Search and Reciprocal Rank Fusion (RRF)

Having two sets of search results (one from the vector database and one from BM25) creates a new problem: how do you combine them? You can't simply add the scores because a cosine similarity of 0.82 isn't comparable to a BM25 score of 14.5.

We use Reciprocal Rank Fusion (RRF) for this. RRF is an algorithm that ranks results based on their position in the different lists rather than their raw scores. The formula is straightforward:

score = sum(1 / (k + rank))

Where k is a constant (typically 60). This ensures that documents ranking high in both systems move to the top, while outliers in either system still get a fair shake. This gave us a more stable baseline, but we weren't satisfied yet.

Why We Needed Cohere Rerank

Even with Hybrid Search, the context sent to an LLM is often still noisy. You might retrieve 20 documents, but you only need the top 3 or 5 for an accurate answer. How do you determine the absolute best?

This is where we introduced Cohere Rerank. In information retrieval, we distinguish between Bi-Encoders and Cross-Encoders.

Bi-Encoders (Vector Search) calculate the embedding of the query and the document independently. This is fast but loses nuance.
Cross-Encoders (Reranking) process the query and the document together in the model. This allows the model to directly weigh the interaction between the words in the question and the text in the document.

As detailed in Cohere’s research, the Reranker acts as a precision instrument. We send the top 25 results from our Hybrid Search to the rerank-v3 model. This model assigns a relevance score between 0 and 1 to each pair.

The difference is night and day. Where a vector search might return a document because the "vibe" matches, the Reranker notices that a crucial condition from the query is missing in the text. This allows us to use a much smaller context window, which reduces costs and drastically minimizes hallucinations. The LLM no longer has to sift through 10 pages of noise.

Impact on Alex and Vera

Since implementing this architecture, we’ve seen a significant increase in "Hit Rate" in our internal benchmarks. For complex queries regarding API specifications, the accuracy of retrieved context jumped from ~65% to over 92%.

At PrudAI, we don't believe in AI magic; we believe in robust engineering. Moving from simple vector search to a multi-stage retrieval pipeline is proof that the foundation of a great AI agent isn't just the model (GPT or Claude), but the quality of the data you feed it.

If you're still relying solely on vector_store.search(), you're leaving performance on the table. It’s time to take retrieval seriously.

Beyond Vector Search: Why We Switched to Hybrid Search

The Semantic Trap of Vector Embeddings

Bringing Back BM25: Precision Over Intuition

Hybrid Search and Reciprocal Rank Fusion (RRF)

Why We Needed Cohere Rerank

Impact on Alex and Vera

Beau Jonkhout