At OffplanProperties.ai, I shipped a RAG system that indexed 300,000+ long-form real estate pages. The v1 was naive RAG: embed, top-k, stuff into context, generate. It worked in demos and failed in production.
Over six months, I rebuilt it as Agentic RAG, and the numbers speak for themselves: retrieval precision from 0.62 to 0.91, hallucination rate from 11% to under 2%, p95 latency still under 800ms.
Why Naive RAG Fails in Production
Naive RAG has three silent killers:
- Embedding drift, user queries and document chunks live in different vocabularies ("2-bed apartment near metro" vs. legal descriptions full of "ensuite" and "RERA")
- Chunking blindness, important context gets split across chunks and never retrieved together
- No recovery, when retrieval fails, the LLM hallucinates confidently instead of asking for clarification
Step 1: Hybrid Search (BM25 + Vector)
My first upgrade was combining BM25 keyword search with dense vector search using reciprocal rank fusion. This alone moved precision from 0.62 to 0.74.
// RRF fusion
for doc in (bm25_results | vector_results):
score[doc] += 1 / (60 + rank[doc])
Step 2: Cross-Encoder Rerank
After hybrid search gives you the top-50, rerank with a cross-encoder. Cross-encoders are slow at scale but precise, and you're only running them on 50 candidates.
I used BAAI/bge-reranker-large. Added 120ms to p95 but moved precision from 0.74 to 0.85.
Step 3: Query Transformation
Here's where it gets agentic. Before retrieval, a small LLM rewrites the user's query:
- Expansion: "cheap 2BR Marina" → "affordable two-bedroom apartments in Dubai Marina under AED 1.5M with sea view"
- Decomposition: "which new projects have pools and gyms near metro" → 3 sub-queries
- HyDE: hypothetical document generation to bridge the query/document vocabulary gap
Step 4: The Agentic Loop
The retrieval engine itself became an agent. Given a query, it:
- Decides whether to search, reformulate, or answer directly
- Runs retrieval, scores confidence
- If confidence is low, reformulates and retries (max 3 hops)
- If still low, asks the user a clarifying question instead of hallucinating
This is what people mean by "Agentic RAG." It's not a buzzword, it's the difference between a system that confidently wrong and one that knows when to stop.
Step 5: Evaluation Harness (RAGAS + LangSmith)
None of this matters if you can't measure it. I built an eval harness with RAGAS for retrieval metrics (context precision, recall, faithfulness) and LangSmith for end-to-end traces with human feedback.
Every PR runs 200 golden questions. If faithfulness drops below 0.9 or latency exceeds 1s, the build fails. Saved me from shipping three silent regressions in the first month.
Takeaways
Naive RAG is a demo. Agentic RAG is production. The gap between them is measurement.
If you're building RAG in 2026, start with hybrid search + rerank + eval harness on day one. Don't ship without them.
