
Retrieval-Augmented Generation is no longer an exotic architecture - it is the default way to ship a customer-facing AI agent that does not invent policy, misquote prices, or hallucinate features that do not exist. The recipe is conceptually simple: pair a retrieval system that surfaces the right snippets of your knowledge base with a generative model that turns those snippets into a sentence the customer actually wants to read. Getting the recipe right, however, is where most teams stumble.
This guide walks through the moving parts of a production RAG agent in May 2026, the pipeline stages that connect them, the model and infrastructure choices that have shifted with the latest wave of frontier and open-weight releases, and the operational details that separate a demo from a deployment that holds up under real ticket volume.
What RAG actually does - and why it still matters in 2026
A naked language model only knows what was in its training corpus, frozen at some cutoff date. Ask it about your refund policy, your latest pricing tier, or how a customer's specific subscription is configured, and it will either refuse or confabulate. RAG fixes that by injecting fresh, authoritative context into the model's prompt at query time. The model is no longer guessing - it is reading from your knowledge base before it answers.
A reasonable question, given that Gemini 3.1 Ultra now ships with a 2M-token context window and Claude Opus 4.6 / Sonnet 4.6 offer 1M tokens at no surcharge, is whether RAG is still necessary at all. Why not just stuff the entire knowledge base into the prompt? In practice, three reasons keep RAG firmly relevant. First, cost: paying for a million input tokens on every customer message gets expensive fast, even at the deflated 2026 rates of open-weight models like DeepSeek V4 Flash at $0.14 per million input tokens. Second, latency: the larger the prompt, the longer the time-to-first-token, and customer support conversations are sensitive to perceived snappiness. Third, recall quality: long-context models are dramatically better than they used to be at attending to needles in a haystack, but a focused 4K of well-chosen context still beats 800K of mostly-irrelevant material on accuracy benchmarks. RAG has shifted from a hard requirement to a tuning lever - but the lever is still very much in the panel.
The other reason RAG remains the right primitive: it is the natural place to enforce access control, freshness, and source attribution. When a billing agent answers a customer about their plan, you need to know which document the answer came from, whether that document is allowed to be shown to that user, and when it was last updated. None of those guarantees come for free from in-context-only setups.
The components of a RAG system
A working RAG pipeline is the sum of about eight parts. Each one has a defensible default, and each one is also a lever you can pull when something is going wrong.
Knowledge base
Your knowledge base is whatever surface area of your business the agent is allowed to read from: help-center articles, product documentation, internal runbooks, policy PDFs, Notion workspaces, Google Drive folders, support macros, FAQ databases, transcripts of past calls. The single most underrated factor in RAG quality is the curation and freshness of this corpus. A perfectly engineered pipeline on top of stale or contradictory documentation will dutifully answer with stale or contradictory information.
Berrydesk lets you skip the plumbing entirely here - point it at files, raw text, a public website, a Q&A list, a Notion workspace, a Google Drive folder, or even YouTube transcripts, and ingestion happens for you. For teams that prefer to roll their own, this is the stage where most of the hidden work lives.
Document loaders
Loaders pull raw bytes from each source and turn them into clean text. PDFs need OCR-aware extraction that preserves table structure. Web pages need a headless-browser fetch and boilerplate stripping so menus and footers do not pollute your index. Notion and Drive need authenticated API calls. Office documents - DOCX, XLSX - often need a normalization pass to strip revision markup, comments, and formatting noise that confuse downstream chunking.
The failure mode here is silent: garbage in the loader stage produces fluent-sounding but subtly wrong answers downstream, and those are notoriously hard to debug.
Text splitter
LLMs read chunks, not whole documents. Splitting cuts the loaded text into pieces small enough to be precisely retrievable but large enough to carry context around any given fact. The classic strategies - fixed window, recursive character splitting, and structural splitting that respects headings and lists - are still the workhorses. Semantic chunking, which cuts on natural topic boundaries detected by an embedding model, has matured considerably and is now worth trying when straightforward chunking misses related ideas across paragraph breaks.
The chunk size sweet spot for support content is usually somewhere between 300 and 800 tokens, with overlap of 50 to 150 tokens so an answer that straddles a boundary is not severed. Tune it against your own evaluation set rather than copying a default.
Embedding model
Embeddings convert each chunk - and later each user query - into a vector that captures meaning, so that "How do I cancel my plan?" lands near a chunk titled "Subscription cancellation steps" even when there is no surface-word overlap. Strong defaults today include OpenAI's text-embedding-3-large, Cohere's multilingual embedders, Voyage models, and a long bench of open-source options from the BGE, E5, and GTE families on Hugging Face. The Qwen3 embedding family in particular has become a credible open default for multilingual support corpora.
For most support deployments, fine-tuning embeddings is overkill - the gains are real but small relative to the engineering cost. Spend that budget on better chunking, better evaluation, and a reranker before you spend it on training a custom embedder.
Vector database
The vector store keeps every chunk's embedding alongside its source text and metadata, and answers the question "give me the top-K chunks closest to this query vector." Specialized stores like Pinecone, Weaviate, Qdrant, Milvus, and Chroma are all production-grade in 2026 and differentiate mostly on operational ergonomics, hybrid-search support, and filter-language richness.
If you already run Postgres, the pgvector extension is a perfectly respectable starting point and saves you a moving part - Supabase ships it natively, and most teams do not outgrow it until well past the million-chunk mark. SQLite users have analogous options like sqlite-vec. The honest answer is that the database is rarely the bottleneck; the retrieval logic on top of it is.
Retriever
The retriever is the orchestrator that takes a user question, embeds it, queries the vector store, and returns ranked chunks. The naive version is a top-K cosine similarity search. A production version layers on more: hybrid retrieval that blends vector similarity with traditional BM25 keyword scoring (so exact product names and SKUs do not get washed out), metadata filtering (so a query in a customer's account context only sees their tenant's data), query rewriting (so an ambiguous "how do I cancel?" gets resolved to "cancel subscription" using conversation history), and reranking with a cross-encoder that re-scores the top 30 candidates to surface the best 5.
For small corpora - say, under a few thousand short articles - pure full-text search can match or beat semantic search and saves you the embedding pipeline entirely. Do not skip this option just because vector search is fashionable.
This stage is where retrieval quality is won or lost, and it is the right place to invest engineering time.
Generator (the LLM)
The generator takes the user's question and the retrieved chunks and produces the final answer. The 2026 menu is dramatically richer than it was even a year ago, and choosing well is now as much about cost and latency as it is about raw capability.
On the closed-frontier side, Claude Opus 4.7 leads SWE-bench Pro at 64.3% and is the strongest pick when the agent has to reason carefully through a long policy or stitch together multi-step tool calls. GPT-5.5 and GPT-5.5 Pro are the parallel-reasoning workhorses for complex escalations. Gemini 3.1 Ultra with its 2M-token context is the obvious choice when you want to throw an entire knowledge base in and skip retrieval altogether on certain queries; Gemini 3.1 Pro leads GPQA Diamond at 94.3% for technical-domain questions.
On the open-weight side, the cost story is what changes the deployment math. DeepSeek V4 Flash (284B / 13B active, 1M context, $0.14 / $0.28 per million tokens) handles routine support traffic at a fraction of a cent per resolution. MiniMax M2.7 (230B / 10B active) runs at roughly 8% the price of Claude Sonnet at twice the speed and posts 56.22% on SWE-Pro. Z.ai's GLM-5.1 (754B MoE, MIT-licensed) hits 58.4 on SWE-Bench Pro and is built for agentic plan-execute-test-fix loops up to eight hours long. Moonshot's Kimi K2.6 can run twelve-hour autonomous coding sessions and coordinate hundreds of sub-agents - overkill for "where's my order?" but a serious option for complex back-office support automation. Alibaba's Qwen3.6-27B is dense, Apache-licensed, and beats far larger MoE rivals on agentic coding benchmarks, which makes it a strong local-deploy pick. Xiaomi's MiMo-V2-Pro (>1T params, 42B active, 1M context, MIT-licensed) is the new entry to watch for reasoning-heavy support flows.
For regulated and air-gapped deployments - health, finance, government - the MIT and Apache licenses on GLM-5.1, Qwen3.6-27B, and MiMo make on-prem setups genuinely viable rather than aspirational.
The pragmatic pattern most production support teams now adopt is routed inference: send routine intents to a fast, cheap open-weight model (V4 Flash or M2), and only escalate hard cases - multi-document reasoning, ambiguous tickets, sensitive policy interpretations - to Claude Opus 4.7 or GPT-5.5 Pro. Berrydesk lets you select the model per agent or per intent, so you can tune this tradeoff without leaving the dashboard.
Orchestration framework
LangChain, LlamaIndex, and Haystack remain the standard toolkits for stitching loaders, splitters, embedders, retrievers, and generators together. LlamaIndex is still the most opinionated about RAG specifically and is the fastest path from zero to a working pipeline. LangChain's strength is the breadth of integrations and its agent abstractions when you want the LLM to do more than answer questions. Haystack is popular in regulated settings where pipeline-level configurability matters.
If you would rather not pick a framework at all, Berrydesk is the layer above all of this - you connect sources, choose a model, and the orchestration is handled for you.
How the pipeline actually runs
A RAG system has two distinct lifecycles: one offline, one online. Confusing them is the most common architectural mistake.
Stage 1: Indexing (offline, runs on ingest and on update)
This is the work that happens whenever a document is added or changes. Loaders pull raw text from each source. The splitter breaks each document into chunks, ideally preserving heading hierarchy and other structural cues as metadata. The embedding model converts each chunk into a vector. The vector store persists the vectors alongside the original text and any metadata you want to filter on later - tenant_id, product_area, last_updated, audience, source_url. A scheduled re-indexing job keeps the corpus current; for high-velocity content like changelogs or status pages, you want this measured in minutes, not days.
Cleanliness matters more than cleverness here. Stripping boilerplate, normalizing inconsistent terminology, deduplicating overlapping articles, and adding rich metadata at index time pay back many times over at retrieval time.
Stage 2: Retrieval and generation (online, runs per user message)
A customer message arrives. The system embeds the message - usually with the same model used for chunks, though some setups use a query-specific embedder. Optionally, a query rewriter expands or disambiguates the message using the prior conversation turns ("it" becomes "the Pro plan"). The retriever pulls candidate chunks from the vector store, optionally fuses them with BM25 results, and applies metadata filters so a user only sees content scoped to their tenant, locale, and entitlement.
A reranker (often a smaller cross-encoder like Cohere Rerank or an open-source equivalent) re-scores the top candidates and trims to the few that the LLM will actually see. The system assembles a prompt that combines a system instruction ("answer only from the provided context, cite the source, decline if unsure"), the retrieved chunks, the conversation history, and the user's question. The generator produces the response, ideally streaming tokens to the widget so the customer sees progress immediately. The final answer goes to the UI with citations linking back to the source documents.
Done well, this whole loop runs in under two seconds for the first token, even with a reranker in the path.
Tools and technologies, 2026 edition
A working snapshot of the ecosystem you can actually pick from today:
- Frameworks: Berrydesk (managed), LangChain, LlamaIndex, Haystack, Langflow.
- Vector databases: Pinecone, Weaviate, Chroma, Milvus, Qdrant, FAISS, pgvector on Postgres, Vertex AI Vector Search, Turbopuffer.
- Embedding models: OpenAI
text-embedding-3-large, Cohere Embed v3, Voyage v3, BGE-M3, E5-Mistral, GTE-Qwen2, Qwen3 embeddings. - LLMs: GPT-5.5 / 5.5 Pro, Claude Opus 4.7, Claude Sonnet 4.6, Gemini 3.1 Ultra / Pro, DeepSeek V4 Pro / V4 Flash, Kimi K2.6, GLM-5.1, Qwen3.6-Plus / 27B / 35B-A3B, MiniMax M2.7, Xiaomi MiMo-V2-Pro.
- UI layers (if you are rolling your own): Streamlit, Chainlit, Gradio, Vercel AI SDK, or - to skip the UI work - Berrydesk's branded widget.
Tips for shipping a RAG agent that actually works
The components are the easy part. The compounding details are what determine whether the agent gets praised or quietly disabled three weeks after launch.
Data quality is upstream of everything. Spend the first week of any RAG project on the corpus, not the pipeline. Audit for stale articles, contradictions between documents, missing edge-case coverage, and product names that drift across sources. A good rule of thumb: if a new human support hire would be confused by your knowledge base, your AI agent will be too.
Chunking is a hyperparameter, not a default. Test at least three chunking strategies - fixed windows, recursive splitting on document structure, and semantic chunking - against a held-out evaluation set of real support questions. Differences of 5–10 percentage points in retrieval recall are common.
Use hybrid search by default. Pure semantic search misses exact-match cases - order numbers, error codes, product SKUs - and pure keyword search misses paraphrases. Reciprocal rank fusion of the two is cheap and almost always wins.
Add a reranker before you tune anything else. A cross-encoder reranker on the top 30 candidates is the highest-ROI single addition to a basic RAG system. It is the difference between "frustratingly close" and "actually right."
Engineer the prompt like a product surface, not a string. The system prompt is where you encode tone, formatting rules, citation requirements, refusal behavior, and escalation triggers. Version it, test it, and treat changes the way you treat code changes.
Manage conversation history carefully. A bare RAG loop has no memory. Real support conversations have follow-ups, clarifications, and topic shifts. Summarize older turns, keep the most recent verbatim, and re-embed the full intent - not just the last message - when retrieving context.
Build a real evaluation harness. A set of 50 to 200 representative customer questions with reviewed ideal answers, scored on retrieval recall and answer correctness, is enough to catch most regressions. Run it on every prompt and model change. "It seemed better in the demo" is not a release criterion.
Respect access control end to end. Metadata filters at retrieval time must mirror your application's permission model. The fastest way to a security incident is an agent that cheerfully recites internal documents to the wrong tenant because a filter was missing.
Watch for the common failure modes. Hallucination when retrieval comes up empty (fix: detect low retrieval scores and have the model say "I don't know" rather than guess). Confident answers from out-of-date chunks (fix: surface last_updated in the prompt and let the model flag staleness). Over-indexing on the most-retrieved chunks (fix: diversify retrieval with maximal marginal relevance). Each of these has a known mitigation; none of them go away on their own.
Plan for cost from day one. With routed inference between cheap open-weight models for routine traffic and frontier models for escalations, a well-tuned support agent in 2026 costs cents per resolution, not dollars. But that only happens if you measure cost per intent and route accordingly. Logging input/output token counts per turn is non-negotiable.
When to build and when to buy
If your team has ML engineers, an evaluation culture, and a need for unusual control over the pipeline - bespoke retrievers, custom rerankers, complex multi-hop reasoning across structured and unstructured data - building from LlamaIndex or LangChain is a perfectly reasonable path. You will own every component, which is both the cost and the point.
If your goal is to ship a branded support agent that handles the bulk of customer questions accurately, escalates the rest cleanly, takes booking and payment actions through tool use, and is deployed to a website, Slack, Discord, and WhatsApp - the build does not need to be from scratch. That is exactly the territory Berrydesk covers: connect your sources, choose your model from across the closed and open-weight frontier, customize the widget, wire up AI Actions, and ship. The RAG plumbing is already built and tuned underneath.
Either way, the underlying architecture is the same - knowledge base, loader, splitter, embedder, vector store, retriever, generator, orchestrator. The 2026 difference is that the model choices are richer, the costs are lower, the context windows are longer, and the line between "demo" and "production" is much shorter than it used to be.
Want to skip the pipeline work and have a RAG-powered support agent live this afternoon? Start building on Berrydesk - pick your model, point it at your docs, and go.
Skip the plumbing - launch a RAG agent in minutes
- Connect docs, websites, Notion, Drive, or YouTube and Berrydesk handles chunking, embeddings, and retrieval for you.
- Pick from GPT-5.5, Claude Opus 4.7, Gemini 3.1, DeepSeek V4, Kimi K2.6, GLM-5.1, Qwen3.6, MiniMax M2, and more.
Set up in minutes
Chirag Asarpota is the founder of Strawberry Labs, the team behind Berrydesk - the AI agent platform that helps businesses deploy intelligent customer support, sales and operations agents across web, WhatsApp, Slack, Instagram, Discord and more. Chirag writes about agentic AI, frontier model selection, retrieval and 1M-token context strategy, AI Actions, and the engineering it takes to ship production-grade conversational AI that customers actually trust.



