
Retrieval Augmented Generation has become the default architecture for grounding language models in private, up-to-date knowledge. It is the reason a support agent can answer a question about your refund policy without you fine-tuning a model, and the reason an internal assistant can quote last week's design doc instead of inventing one. But anyone who has shipped RAG into production knows the rough edges: the model occasionally pulls back the wrong passage, or the right passage ranked at position twelve when the prompt only had room for the top five. The generation looks fine on the surface and quietly misses the point.
The bottleneck is almost never the language model. It is retrieval. Even when the right answer lives in your knowledge base, the first-pass search may not surface it at the top - and the model can only reason over what it sees. This is the gap reranking closes. A reranker takes a coarse list of candidates and reorders them with a much sharper notion of relevance, so that the documents the model finally reads are the ones it actually needs.
In this guide we will walk through how reranking works, which model families lead the space in 2026, how to wire one into a working pipeline, and the trade-offs that actually matter when you are running an AI support agent at scale.
Why Retrieval Alone Falls Short
A standard RAG pipeline embeds your documents into vectors, indexes them in a vector store, embeds the user query at request time, and pulls the top-k nearest neighbors. It is fast, it scales, and for many use cases it is good enough. The problem is the compression step. An embedding crushes a 500-token chunk into a single fixed-length vector. Subtle distinctions - the difference between "how do I cancel a subscription" and "how do I get a refund after canceling" - can collapse into similar coordinates, and the wrong chunk floats to the top.
Keyword retrieval like BM25 has the opposite failure mode. It is precise when terms match exactly, but it fumbles when a customer asks about "billing" and your docs say "invoicing." Hybrid retrieval - combining dense and sparse signals - softens both problems but does not solve them. You are still ranking with shallow features against an open-ended natural-language query.
Reranking adds a second, smarter pass. Instead of comparing two summary vectors, a reranker reads the query and each candidate document together and scores their relevance directly. It is slower per pair, but it only runs over a small, pre-filtered set, and the quality jump is usually large enough to be worth the latency.
How a Reranker Actually Works
The high-level pattern is a two-stage retrieval pipeline:
- Initial retrieval. A fast method - embedding similarity, BM25, or a hybrid of both - pulls a wide candidate set, usually somewhere between 25 and 200 documents. The goal here is recall: get the right answer somewhere into the bucket.
- Reranking. A heavier model scores each
(query, candidate)pair and reorders the list. The top few - typically three to ten - are passed to the language model as context.
The split is what makes the architecture practical. You do not run a heavy cross-attention model over your entire corpus on every request; you run it over a few dozen candidates that survived the cheap filter. You spend compute where it actually changes the ranking and almost nowhere else.
The reason this is a real upgrade and not just an extra step is that a reranker can attend to specific tokens in both the query and each document at the same time. A bi-encoder embedding model has to commit to a single representation for a document before it ever sees a query. A cross-encoder reranker gets to look at the query first and then decide what matters in the document. That is a fundamentally richer signal, and it shows up in benchmark numbers and, more importantly, in user-facing answer quality.
The Reranker Landscape in 2026
A handful of distinct architectures dominate production deployments today. They make different bets on the cost-quality trade-off.
Cross-Encoders
Cross-encoders are the workhorse. The query and a candidate document are concatenated and fed through a transformer; the model outputs a single relevance score. Because the two sides interact at every layer, cross-encoders capture nuance that bi-encoders miss.
The trade-off is computational. You cannot precompute a cross-encoder representation of a document, because the score depends on the query. Every request runs the model fresh over each candidate. For a support agent fielding a few hundred queries per second, this matters; for most teams it is fine, especially if you cap the candidate set at 50 or so.
Open-source families like the BGE rerankers and Jina's reranker line are the typical starting point. They are small enough to self-host on a single GPU and accurate enough to deliver double-digit gains in MRR over a pure vector search baseline.
Multi-Vector Models
Multi-vector approaches like ColBERT split the difference. Instead of one embedding per document, a document is represented as a set of contextual token embeddings, and scoring is a late-interaction operation between query tokens and document tokens. Document representations can be precomputed and indexed, so query-time work is bounded by the number of query tokens.
The result is a model that gets close to cross-encoder quality with sub-second latency on large corpora. The complexity is in the index - you are storing many vectors per document - but for high-traffic systems where every millisecond matters, it is a strong fit.
LLM-Based Rerankers
The frontier models have changed what is feasible here. Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro can be prompted to rank a list of candidates with a few words of instruction, and they will often beat a purpose-built reranker on out-of-domain queries. Open-weight agentic models like Kimi K2.6, GLM-5.1, and Qwen3.6 do the same job at a fraction of the cost.
The case for an LLM reranker is flexibility. You can encode arbitrary business rules in the prompt - "prefer recent KB articles over forum threads for billing questions" - and you can iterate on ranking behavior without retraining anything. The case against it is cost and latency. Even with prompt caching, calling a frontier model on every query to rerank twenty candidates is more expensive than running a fine-tuned cross-encoder, and the latency budget is tighter.
A reasonable compromise is to use an LLM reranker for hard or ambiguous queries - detected by low first-pass confidence - and a cheap cross-encoder for everything else.
Hosted APIs
Cohere, Voyage, Jina, and a growing list of providers offer reranking as a managed endpoint. You pay per request, you get continuous model improvements, and you skip the infrastructure work. The trade-offs are familiar: less control, data leaves your environment, and you are betting on a third party's roadmap. For regulated industries where data residency matters, a self-hosted open-weight reranker is usually the safer play.
Building Reranking Into Your Pipeline
The mechanics are not complicated. The judgment calls around them are where most teams trip up.
Pick a Reranker That Matches Your Constraints
Three axes matter:
- Quality vs. latency. Cross-encoders score highest on standard benchmarks. Multi-vector models trade a few points of quality for a large latency win. LLM-based rerankers vary wildly depending on which model you call.
- Domain fit. A reranker trained or fine-tuned on text similar to yours will outperform a general-purpose one. Legal, medical, and code retrieval all benefit from domain-specific training. For mixed-domain support agents, a strong general model usually wins.
- Deployment surface. If you need on-prem or air-gapped, an open-weight reranker is your only real option. If you need to ship in a week, a hosted API is faster.
Wire It Into Your Existing Retriever
Most modern frameworks treat reranking as a pluggable post-retrieval step. The pattern is the same regardless of which library you use:
candidates = vector_store.search(query, top_k=50)
reranked = reranker.score(query, candidates)
context = reranked[:5]
answer = llm.generate(query, context)
If you are running on Berrydesk, this is wired up by default - you choose the model and the sources, and retrieval and reranking are tuned underneath. If you are building a custom stack, the integration is roughly that simple. The harder work is everything around it: chunking, evaluation, monitoring.
Tune the Knobs That Matter
Reranking introduces a few hyperparameters that materially change behavior:
- Initial candidate count (
top_k). Larger candidate sets give the reranker more to work with but raise cost and latency. For most support workloads, 30 to 50 is a good starting point. Below 20, the reranker often cannot recover from a poor first-pass; above 100, you stop seeing quality gains. - Final cutoff. How many reranked documents do you actually pass to the model? With long-context models - Claude Sonnet 4.6 and Opus 4.6 at 1M tokens, Gemini 3.1 Ultra at 2M - you can afford to send more, but more is not always better. Irrelevant context can drown out the right answer. Five to ten well-ranked chunks beats fifty mediocre ones.
- Score threshold. Discard candidates below a relevance floor. This is the cleanest way to handle queries that have no good answer in your knowledge base - instead of feeding the model weak context, hand it nothing and let it say "I don't know."
- Caching. Customer support traffic is heavily repetitive. A simple cache keyed on normalized query text can serve a meaningful share of requests without ever hitting the reranker.
Handle Long Documents the Right Way
Most rerankers have an input limit, often a few thousand tokens. If your knowledge base contains long-form articles, runbooks, or PDFs, you have to chunk before you index. Two practical patterns:
- Chunk and rerank at the chunk level. Standard. Make sure chunks have enough overlap and context that they are interpretable in isolation.
- Aggregate scores per source document. If a single article matters more than a single chunk for your downstream prompt, score chunks independently then roll up to a parent score (max, mean, or top-k mean).
The chunking strategy ends up mattering more than the reranker choice in many deployments. A bad chunking pipeline guarantees a bad ranker; a thoughtful one - semantic boundaries, metadata-rich chunks, careful overlap - gives even a mid-tier reranker something good to work with.
Long Context Versus Reranking: Do You Even Need RAG?
A fair question in 2026 is whether you still need a retrieval pipeline at all. Claude Opus 4.6 and Sonnet 4.6 ship with a 1M-token context at no surcharge. Gemini 3.1 Ultra extends that to 2M. DeepSeek V4 Pro and Flash are at 1M. You can stuff a small or mid-sized company's entire support knowledge base into a single prompt and let the model do its own retrieval over what it sees.
For some teams, this is genuinely the right answer. If your corpus is under a few hundred thousand tokens, fairly stable, and your latency budget allows, long-context prompting is simpler than RAG and often produces better answers because the model can reason over the full document set.
For most teams, RAG plus reranking still wins. Three reasons:
- Cost at scale. Sending 1M tokens of context on every request, even with prompt caching, costs more than retrieving and reranking a few thousand tokens. A typical Berrydesk deployment routes routine traffic to DeepSeek V4 Flash at $0.14 / $0.28 per million input/output tokens - but that math only works if you keep the context tight.
- Freshness. Knowledge bases change. Re-uploading an entire corpus on every update is fine for static documentation; it falls apart when product docs, ticket history, and policy pages evolve daily.
- Attribution. When a model cites its sources, customers and auditors want to see exactly which document it pulled from. A retrieval pipeline gives you that for free.
The honest framing in 2026: long context is a tuning lever, not a replacement. Use it to relax your chunk-size constraints, to send more reranked candidates than you used to, and to handle edge cases where a single long document needs to be reasoned over end-to-end. Keep RAG and reranking as the backbone.
Evaluating a Reranker Honestly
It is easy to flip on a reranker and see your demo queries get better. It is harder to know whether the system is actually winning on the long tail. A real evaluation pass needs three layers.
Offline IR Metrics
Build a labeled test set - a few hundred query-document relevance judgments is enough to start - and track standard information retrieval metrics:
- Mean Reciprocal Rank (MRR). How early does the first relevant document appear? A clean signal for whether the reranker is moving the right answer up.
- NDCG@k. Quality of the full ranking, weighted by position, with graded relevance. Better than MRR when more than one document is relevant.
- Recall@k. Did the right document make it into the final cutoff at all? If recall is low, you do not have a reranking problem - you have a first-pass retrieval problem.
Online Evaluation
Offline metrics are necessary but not sufficient. The system that matters is the end-to-end one. Track:
- Resolution rate. For a support agent, what share of conversations end with the user's problem solved without escalation?
- Citation accuracy. When the model cites a source, is it actually the source that answers the question? Easy to spot-check with a small judge model or sampled human review.
- Escalation patterns. Are tickets escalating because the agent retrieved nothing useful, or because the policy genuinely required a human? The two failure modes need different fixes.
Error Analysis
Sample the queries where the reranker visibly fails and look for patterns. Common ones:
- Queries that span multiple intents ("cancel and refund") where the reranker latches onto one and misses the other.
- Queries phrased in customer language that does not appear in your docs.
- Queries about edge cases that your knowledge base does not cover at all - where the right answer is to escalate, not to retrieve harder.
Each of these has a different fix: better chunking, query rewriting, or fallback handling. Reranking is not a universal solvent.
Common Pitfalls
A few patterns show up repeatedly when teams ship reranking and do not see the wins they expected.
- First-pass recall is too low. If the right document is not in the top-50 candidates, no reranker can save you. Verify recall before tuning ranking.
- Chunks are too small or too big. Tiny chunks lose context; huge ones dilute relevance signals. Aim for chunks that are self-contained answers to plausible questions.
- The reranker was trained on a domain that does not match yours. A general-web reranker will underperform on dense legal text. Either fine-tune or pick a model trained on a closer domain.
- No evaluation set. Without offline metrics, you are flying blind. A few hundred labeled examples is cheap and pays for itself within a week.
- Ignoring latency. Cross-encoders running over 100 candidates can add 300ms or more to a request. For a support widget where the user is watching the typing indicator, that is noticeable. Profile early.
Where Reranking Is Headed
Three trends are shaping the next eighteen months.
Multimodal retrieval. Support knowledge increasingly includes screenshots, product photos, and short videos. Rerankers that can score across modalities - query text against an image-and-caption pair, for example - are moving from research demos into hosted APIs. Gemini 3.1's native multimodality and the video-input support shipping in Kimi K2.6 are accelerating this.
Instruction-driven ranking. Instead of a single relevance score, modern LLM rerankers can take a natural-language instruction ("prefer documents from the last 90 days, deprioritize community posts unless no official answer exists") and apply it on the fly. This collapses what used to be retraining work into prompt iteration.
Agentic retrieval loops. Models like Kimi K2.6 and GLM-5.1 are built to plan, retrieve, evaluate, and re-retrieve in a loop. Reranking inside that loop becomes a tool the agent calls on demand rather than a fixed pipeline stage. For complex support questions that span multiple sources, this changes the architecture from "search once, generate" to "search, read, refine, answer."
The teams that benefit most from these shifts are the ones who have already done the unglamorous work - clean chunking, evaluation harnesses, latency budgets - because every new capability slots into that foundation.
Getting It Right Without Building From Scratch
Reranking is one of the highest-leverage upgrades you can make to a RAG system. It is also one of the easiest to get subtly wrong, because the failure modes are quiet: an answer that looks fine but missed the better passage, a citation that points to a near-miss instead of the right doc.
If you are running an AI support agent, you do not have to assemble this stack yourself. Berrydesk lets you pick a model - GPT-5.5, Claude Opus 4.7, Gemini 3.1, DeepSeek V4, Kimi K2.6, GLM-5.1, Qwen3.6, MiniMax M2, and others - point it at your docs, websites, Notion, Google Drive, or YouTube content, and ship a branded agent in a few steps. Retrieval, chunking, and reranking are tuned underneath, and you can layer on AI Actions for booking, refunds, and payments without writing glue code.
If you would rather see it than read about it, start building on Berrydesk - it is free to try, and you can have a working agent in front of real traffic in an afternoon.
Ship a smarter support agent without rebuilding your stack
- Pick from GPT-5.5, Claude Opus 4.7, Gemini 3.1, DeepSeek V4, Kimi, GLM, Qwen, MiniMax, and more
- Train on docs, websites, Notion, Drive, and YouTube - with retrieval tuned for you
Set up in minutes
Chirag Asarpota is the founder of Strawberry Labs, the team behind Berrydesk - the AI agent platform that helps businesses deploy intelligent customer support, sales and operations agents across web, WhatsApp, Slack, Instagram, Discord and more. Chirag writes about agentic AI, frontier model selection, retrieval and 1M-token context strategy, AI Actions, and the engineering it takes to ship production-grade conversational AI that customers actually trust.



