RAG from First Principles: Build a Working Retrieval...

Retrieval-augmented generation has become one of those terms that everyone uses and almost no one defines twice the same way. It is also the single most important pattern shaping how teams ship reliable AI features in 2026 - especially in customer support, where the cost of a confidently wrong answer is a refund, a churned account, or a regulatory headache.

This post walks through RAG from the bottom up. We'll define it without jargon, build the smallest version that actually does something useful, see where the naive version breaks, then plug in a modern frontier model to round it off. By the end you'll have a mental model crisp enough to debug a real system, and a tiny working pipeline you can extend.

What RAG actually is

Retrieval-augmented generation is a two-stage trick: first you find relevant text in a corpus you trust, then you hand that text - along with the user's question - to a language model and ask it to write the answer. The "retrieval" half supplies facts. The "generation" half supplies fluency. Neither half is doing the other half's job.

That is the whole idea, and the rest of any RAG conversation is just engineering choices on top of it: how you chunk the documents, how you embed them, which similarity metric you use, how you rank, how you stuff the context window, which model you generate with, and how you guard the seams where things go wrong.

What RAG is not is a model. It is not something you train. It is a runtime pattern that wraps a model. You can swap the model, swap the index, swap the retriever, and still call it RAG. That flexibility is the reason it has outlasted every other pattern people tried for "ground my AI in my data."

Why RAG still matters when models have million-token windows

A reasonable question in 2026 is whether RAG is still necessary. Claude Opus 4.6 and Sonnet 4.6 ship with a 1M-token context window at no surcharge. Gemini 3.1 Ultra goes to 2M. DeepSeek V4 Flash and Pro both clock in at 1M. If you can paste your entire knowledge base into the prompt, do you need a retriever at all?

The honest answer: sometimes no, often yes.

Long context windows make small or medium knowledge bases trivially solvable. If your support corpus is 200,000 tokens of help articles, you can plausibly stuff it into Sonnet 4.6 every turn and skip the retriever entirely. Latency goes up. Cost goes up. But it works, and the engineering is dead simple.

For most production support deployments, though, retrieval still wins on three axes. Cost: re-sending a million tokens of policy documentation on every conversation turn is wasteful, even at the open-weight prices DeepSeek V4 Flash offers ($0.14 per million input tokens). Quality: long-context recall is real but uneven, and packing a context with mostly-irrelevant text dilutes attention on the parts that actually matter. Freshness: you want updates to your help center to be visible to the agent in seconds, not after a redeploy.

Treat long context as a tuning lever, not a replacement. Use it to be lazy where laziness is fine - small corpora, narrow scopes, prototyping. Use retrieval where scale, cost, or freshness force discipline.

The anatomy of a RAG system

Three pieces. That is genuinely it.

The corpus

The corpus is the set of documents you want your agent to draw from. For a support bot at an e-commerce company, it is the help center, the returns policy, the shipping FAQ, the product manuals, maybe internal Notion runbooks for the team. For a clinical assistant it might be treatment guidelines and a slice of PubMed. For an internal HR bot it is the employee handbook and benefits documents.

The unglamorous truth is that most "RAG quality problems" are corpus quality problems. Stale documentation, conflicting versions of the same policy, three different return windows scattered across four pages - none of these are fixed by a better retriever or a smarter LLM. They are fixed by an editor. Treat your corpus the way a librarian treats their collection: cull, version, cross-reference. The agent can only be as accurate as the source.

A reasonable starting corpus for a returns bot might look like a dozen short policy snippets - return window, refund timing, exchange rules, packaging requirements, shipping responsibility, clearance items, and so on. A reasonable production corpus for the same bot is a hundred to a few thousand chunks pulled from your help center, refreshed nightly.

The retriever

The retriever's job is, given a user query, to fetch the small handful of chunks from the corpus most likely to contain the answer. There is a long ladder of techniques here.

At the bottom is lexical overlap - count how many words the query and the document share. The classical version is Jaccard similarity: take the set of words in each, divide the size of the intersection by the size of the union. It is fast, it is interpretable, and it falls apart the moment the user uses a synonym you didn't anticipate. "Refund schedule" and "when do I get my money back" share almost no tokens, but they mean the same thing.

A step up is TF-IDF or BM25, which weights rarer words more heavily and handles term frequency more gracefully. This is what classical search engines used for decades, and it is genuinely strong for keyword-heavy queries.

The modern default is dense vector embeddings: pass each chunk through an embedding model, store the resulting vectors in an index, and at query time embed the user's question and find the nearest neighbors by cosine similarity. This is what powers most production RAG today because it captures meaning rather than literal token overlap. "Refund schedule" and "when do I get my money back" land near each other in vector space.

The serious systems use hybrid retrieval - run BM25 and dense retrieval in parallel, then re-rank the union with a cross-encoder. This catches the cases where keyword search shines (rare product names, error codes, SKUs) and the cases where semantic search shines (paraphrased questions).

For a first build, lexical similarity is fine. You will feel its limits within an hour, and that is the point - the limits will tell you what to build next.

The generator

Once the retriever has the top few chunks, you hand them to a language model along with the user's question and ask for an answer. The model's job is synthesis: read the chunks, identify what addresses the user's intent, write a response in your brand's voice, and cite or quote where appropriate.

The 2026 model landscape gives you a wide menu here. For high-stakes or genuinely complex synthesis - multi-document reasoning, edge cases, anything where a wrong answer is expensive - Claude Opus 4.7 leads SWE-bench Pro at 64.3% and is exceptionally good at staying grounded in provided context. GPT-5.5 Pro brings parallel reasoning and is a strong fit for tickets that need step-by-step diagnosis. Gemini 3.1 Ultra is the right pick when the user's input includes a screenshot, video, or audio clip.

For routine traffic - the 80% of tickets that are "where is my order" and "how do I reset my password" - the open-weight frontier has reset the cost equation. DeepSeek V4 Flash at $0.14 / $0.28 per million input/output tokens makes it cheaper to answer a ticket than to log it. MiniMax M2.7 hits 56.22% on SWE-Pro at roughly 8% the price of Claude Sonnet, at twice the speed. GLM-5.1 from Z.ai actually beats GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro and ships under MIT license, which means you can run it on-prem if you're regulated.

The right architecture is rarely a single model. Route easy traffic to a cheap open-weight model, reserve the frontier for escalations and edge cases, and you get the quality of Opus on the hard 5% while paying DeepSeek prices for the easy 95%.

Building a minimal RAG in Python

Here is a working, deliberately simple implementation. No vector store. No embedding model. No LLM (yet). The goal is to feel the bones of the pattern.

Step 1: define the corpus

We will build a returns-policy bot. The corpus is a list of policy statements, one per chunk.

return_policy_docs = [
    "Our standard return policy allows returns within 30 days of purchase with proof of receipt.",
    "Refunds are processed within 5–7 business days after the returned item is received.",
    "We offer exchanges on defective or damaged products within 14 days of receipt.",
    "For any return, the product must be in its original packaging and unused.",
    "Customers are responsible for return shipping fees unless the product is defective.",
    "Clearance items are final sale and not eligible for return or exchange.",
    "You can initiate a return by contacting our customer support team.",
    "Returns due to change of mind are accepted only if the item is unopened.",
    "Please include the original receipt and a brief reason for return in your package.",
    "Gift returns can be exchanged for store credit within 30 days of purchase.",
]

Real corpora are larger and chunked from longer documents, but the principle is identical: each chunk is small, self-contained, and meaningful on its own.

Step 2: a retrieval function

We'll start with Jaccard similarity. Lowercase, split on whitespace, compare sets.

def jaccard(query: str, doc: str) -> float:
    q = set(query.lower().split())
    d = set(doc.lower().split())
    if not q or not d:
        return 0.0
    return len(q & d) / len(q | d)

Then a function that scores every chunk and returns the best match:

def best_match(query: str, docs: list[str]) -> str:
    scored = [(jaccard(query, d), d) for d in docs]
    scored.sort(reverse=True)
    return scored[0][1]

Step 3: ask a question

user_query = "Can I return a product if it's defective?"
print(best_match(user_query, return_policy_docs))

Output:

We offer exchanges on defective or damaged products within 14 days of receipt.

That is end-to-end RAG, in roughly fifteen lines. No infrastructure. No API key. The retriever did its job; if we wanted, we could already display this verbatim to the user as a serviceable answer.

Where naive retrieval falls apart

Try this query:

print(best_match("I do NOT want to keep my unopened item", return_policy_docs))

You will likely get back the change-of-mind policy - but you might also get something irrelevant, depending on which words happen to overlap with which chunks. The deeper issue is that Jaccard sees no difference between "I want to return X" and "I do not want to return X." It counts overlap, not meaning.

A few specific failure modes worth knowing:

Synonyms. "Refund timing" never matches "when will I get my money back" because the words don't intersect.
Negation. "Can I exchange a clearance item?" and "Clearance items are not eligible for exchange" share a lot of tokens, including the answer-relevant ones, but the negation flips the meaning.
Multi-step questions. "I bought a defective gift more than 30 days ago" needs three policy snippets stitched together. A single best-match retriever returns one.
Long-tail vocabulary. The user types "RMA" or a product SKU. If the document uses "return merchandise authorization" or the product name spelled out, lexical match misses.

The fixes are well-understood. Switch to embeddings to handle synonyms. Add a re-ranker to handle negation and intent. Retrieve the top k chunks rather than the top one, so the generator can stitch. Hybrid-search BM25 + dense to catch the long tail.

But the most impactful upgrade is usually the next one - adding a generation step.

Adding the generation step

Once the retriever returns chunks, you pass them to a language model with a prompt that tells the model what to do. This is where a brittle exact-match system turns into something that feels like a conversation.

Picking a model

For a support agent in 2026, the practical choices look like this:

Routine traffic. DeepSeek V4 Flash, MiniMax M2.7, or Qwen3.6-27B are all strong, cheap, and fast. For a typical support ticket - order status, password reset, simple policy lookup - any of them produces an answer indistinguishable from a frontier model at a fraction of the cost.
Complex synthesis. Claude Opus 4.7 is the gold standard when you need the model to reason carefully across multiple retrieved chunks, follow nuanced instructions, and stay tightly grounded. Its lead on SWE-bench Pro (64.3%) reflects exactly the kind of careful, long-horizon reasoning that translates to multi-document support questions.
Multimodal tickets. Gemini 3.1 Ultra (2M context, native multimodal across text, image, audio, video) is the right pick when users send screenshots or short videos showing what's wrong.
Agentic actions. If the agent has to do more than answer - book a slot, issue a refund, look up an order in your back-office system - Kimi K2.6, GLM-5.1, and Claude Opus 4.7 are the most reliable tool-callers. K2.6 in particular is purpose-built for long agentic sessions, with native video input and the ability to coordinate hundreds of sub-agents on a single task.
Air-gapped / regulated. GLM-5.1 (MIT license), Qwen3.6-27B (Apache 2.0), and Xiaomi MiMo-V2-Pro (open weights, MIT) are the standout open-weight options for industries that need on-prem deployment.

Prompt engineering for RAG

The prompt you wrap around the retrieved chunks does more for quality than people expect. A reasonable starting template:

You are a customer support assistant for {company}.
Answer the user's question using ONLY the information in the policy
excerpts below. If the excerpts don't contain the answer, say you
don't know and offer to connect them to a human.

POLICY EXCERPTS:
{retrieved_chunks}

USER QUESTION:
{query}

Write your answer in 2–4 sentences, in a friendly but precise tone.
Do not invent details not present in the excerpts.

Three things this prompt does that a sloppy version doesn't. It scopes the model to the retrieved evidence ("ONLY"). It gives the model a clean escape hatch when the corpus doesn't cover the question, which dramatically reduces hallucinations. And it constrains tone and length so output stays consistent across thousands of tickets.

Handling multi-document synthesis

Retrieve the top k chunks, not just the top one - three to five is a good default. Pass all of them in. Modern frontier models are excellent at reading several short passages and producing a coherent answer that draws from each.

For trickier compound questions ("I bought a defective gift more than 30 days ago, what are my options?"), you can prompt the model to think step-by-step before answering, or split the question into sub-questions and retrieve for each. This is where agentic models like Claude Opus 4.7 or Kimi K2.6 start to pay for themselves - they will plan their own retrieval calls if you give them the tools.

Common pitfalls in production RAG

Things that don't show up in tutorials but do show up in incident reviews.

Stale corpora. The most common production failure is that someone updated the help center, but the embedding index didn't get rebuilt. The agent confidently quotes last quarter's return window. Fix: build re-indexing into your CMS publish flow, not as a nightly cron.

Chunk boundaries that split the answer in half. If your chunker breaks a paragraph mid-sentence, neither half scores well, and the retriever misses both. Fix: chunk on semantic boundaries (sections, paragraphs) and use overlap.

The corpus contradicts itself. The shipping FAQ says 30 days, the returns policy page says 14, and a buried Notion doc says "we'll handle case by case." The retriever will find all three. The generator will pick one - usually the first - and you have a bug. Fix: an editorial pass before the corpus ever hits production.

Over-trusting similarity scores. A high Jaccard or cosine score doesn't mean the chunk answers the question. Always include a "if the excerpts don't contain the answer, say so" instruction in the prompt and measure how often the model uses it. If it's never, your model is hallucinating; if it's too often, your retriever is weak.

No human escape hatch. Even a perfect RAG system will eventually hit a ticket it can't handle. Build the handoff path before you launch, not after the first angry tweet.

Long context vs RAG: when to pick which

A practical decision matrix for 2026:

Corpus under ~200K tokens, low traffic, prototype phase: skip retrieval. Stuff it all into Sonnet 4.6's context. Move on.
Corpus over a million tokens, or any corpus where you want sub-second latency: classical RAG with embeddings and a vector store.
Anywhere in between: hybrid. Retrieve aggressively (top 20–30 chunks), then let a long-context model do the final synthesis without aggressive trimming.
Regulated, air-gapped, or sovereignty-sensitive deployments: RAG with on-prem dense retrieval and an open-weight generator. GLM-5.1 or Qwen3.6-27B as the generator, your own embedding model, your own vector store. The MIT/Apache licensing on the Chinese open-weight frontier makes this viable for the first time at frontier quality.

The right answer is usually less interesting than the question. Most teams pick one and stick with it. The teams that win pick per-route.

From toy pipeline to production agent

The fifteen-line example above is not a product. It's a teaching tool. A real support agent - the kind you'd actually deploy in front of customers - needs an embedding model and a vector index, chunked and versioned ingestion from your help center, Notion, and Drive, a re-ranker for quality, prompt templates per ticket type, tool-calling for actions like refunds and order lookups, a confidence threshold and a human-handoff path, conversation memory, branding, channel deployment to your website, Slack, Discord, WhatsApp, and observability so you can see where the agent is failing.

That is the part Berrydesk handles. You pick a model - GPT-5.5, Claude Opus 4.7, Gemini 3.1, DeepSeek V4, Kimi K2.6, GLM-5.1, Qwen, MiniMax, and others are all available in four steps. You point us at your docs, websites, Notion, Google Drive, or YouTube and we handle ingestion, chunking, indexing, and retrieval. You brand the chat widget. You add AI Actions for booking, payments, and back-office lookups. You deploy to your website, Slack, Discord, WhatsApp, and other channels.

Build the toy version yourself first. It's the fastest way to understand what's actually happening when a question comes in. Then, when you're ready to ship something customers will rely on, start your Berrydesk agent - we've already built the parts you don't want to.

What RAG actually is

Why RAG still matters when models have million-token windows

The honest answer: sometimes no, often yes.

The anatomy of a RAG system

Three pieces. That is genuinely it.

The corpus

The retriever

The retriever's job is, given a user query, to fetch the small handful of chunks from the corpus most likely to contain the answer. There is a long ladder of techniques here.

For a first build, lexical similarity is fine. You will feel its limits within an hour, and that is the point - the limits will tell you what to build next.

The generator

Building a minimal RAG in Python

Here is a working, deliberately simple implementation. No vector store. No embedding model. No LLM (yet). The goal is to feel the bones of the pattern.

Step 1: define the corpus

We will build a returns-policy bot. The corpus is a list of policy statements, one per chunk.

return_policy_docs = [
    "Our standard return policy allows returns within 30 days of purchase with proof of receipt.",
    "Refunds are processed within 5–7 business days after the returned item is received.",
    "We offer exchanges on defective or damaged products within 14 days of receipt.",
    "For any return, the product must be in its original packaging and unused.",
    "Customers are responsible for return shipping fees unless the product is defective.",
    "Clearance items are final sale and not eligible for return or exchange.",
    "You can initiate a return by contacting our customer support team.",
    "Returns due to change of mind are accepted only if the item is unopened.",
    "Please include the original receipt and a brief reason for return in your package.",
    "Gift returns can be exchanged for store credit within 30 days of purchase.",
]

Real corpora are larger and chunked from longer documents, but the principle is identical: each chunk is small, self-contained, and meaningful on its own.

Step 2: a retrieval function

We'll start with Jaccard similarity. Lowercase, split on whitespace, compare sets.

def jaccard(query: str, doc: str) -> float:
    q = set(query.lower().split())
    d = set(doc.lower().split())
    if not q or not d:
        return 0.0
    return len(q & d) / len(q | d)

Then a function that scores every chunk and returns the best match:

def best_match(query: str, docs: list[str]) -> str:
    scored = [(jaccard(query, d), d) for d in docs]
    scored.sort(reverse=True)
    return scored[0][1]

Step 3: ask a question

user_query = "Can I return a product if it's defective?"
print(best_match(user_query, return_policy_docs))

Output:

We offer exchanges on defective or damaged products within 14 days of receipt.

That is end-to-end RAG, in roughly fifteen lines. No infrastructure. No API key. The retriever did its job; if we wanted, we could already display this verbatim to the user as a serviceable answer.

Where naive retrieval falls apart

Try this query:

print(best_match("I do NOT want to keep my unopened item", return_policy_docs))

A few specific failure modes worth knowing:

Synonyms. "Refund timing" never matches "when will I get my money back" because the words don't intersect.
Negation. "Can I exchange a clearance item?" and "Clearance items are not eligible for exchange" share a lot of tokens, including the answer-relevant ones, but the negation flips the meaning.
Multi-step questions. "I bought a defective gift more than 30 days ago" needs three policy snippets stitched together. A single best-match retriever returns one.
Long-tail vocabulary. The user types "RMA" or a product SKU. If the document uses "return merchandise authorization" or the product name spelled out, lexical match misses.

But the most impactful upgrade is usually the next one - adding a generation step.

Adding the generation step

Picking a model

For a support agent in 2026, the practical choices look like this:

Routine traffic. DeepSeek V4 Flash, MiniMax M2.7, or Qwen3.6-27B are all strong, cheap, and fast. For a typical support ticket - order status, password reset, simple policy lookup - any of them produces an answer indistinguishable from a frontier model at a fraction of the cost.
Complex synthesis. Claude Opus 4.7 is the gold standard when you need the model to reason carefully across multiple retrieved chunks, follow nuanced instructions, and stay tightly grounded. Its lead on SWE-bench Pro (64.3%) reflects exactly the kind of careful, long-horizon reasoning that translates to multi-document support questions.
Multimodal tickets. Gemini 3.1 Ultra (2M context, native multimodal across text, image, audio, video) is the right pick when users send screenshots or short videos showing what's wrong.
Agentic actions. If the agent has to do more than answer - book a slot, issue a refund, look up an order in your back-office system - Kimi K2.6, GLM-5.1, and Claude Opus 4.7 are the most reliable tool-callers. K2.6 in particular is purpose-built for long agentic sessions, with native video input and the ability to coordinate hundreds of sub-agents on a single task.
Air-gapped / regulated. GLM-5.1 (MIT license), Qwen3.6-27B (Apache 2.0), and Xiaomi MiMo-V2-Pro (open weights, MIT) are the standout open-weight options for industries that need on-prem deployment.

Prompt engineering for RAG

The prompt you wrap around the retrieved chunks does more for quality than people expect. A reasonable starting template:

You are a customer support assistant for {company}.
Answer the user's question using ONLY the information in the policy
excerpts below. If the excerpts don't contain the answer, say you
don't know and offer to connect them to a human.

POLICY EXCERPTS:
{retrieved_chunks}

USER QUESTION:
{query}

Write your answer in 2–4 sentences, in a friendly but precise tone.
Do not invent details not present in the excerpts.

Handling multi-document synthesis

Common pitfalls in production RAG

Things that don't show up in tutorials but do show up in incident reviews.

No human escape hatch. Even a perfect RAG system will eventually hit a ticket it can't handle. Build the handoff path before you launch, not after the first angry tweet.

Long context vs RAG: when to pick which

A practical decision matrix for 2026:

Corpus under ~200K tokens, low traffic, prototype phase: skip retrieval. Stuff it all into Sonnet 4.6's context. Move on.
Corpus over a million tokens, or any corpus where you want sub-second latency: classical RAG with embeddings and a vector store.
Anywhere in between: hybrid. Retrieve aggressively (top 20–30 chunks), then let a long-context model do the final synthesis without aggressive trimming.
Regulated, air-gapped, or sovereignty-sensitive deployments: RAG with on-prem dense retrieval and an open-weight generator. GLM-5.1 or Qwen3.6-27B as the generator, your own embedding model, your own vector store. The MIT/Apache licensing on the Chinese open-weight frontier makes this viable for the first time at frontier quality.

The right answer is usually less interesting than the question. Most teams pick one and stick with it. The teams that win pick per-route.

What RAG actually is

Why RAG still matters when models have million-token windows

The anatomy of a RAG system

The corpus

The retriever

The generator

Building a minimal RAG in Python

Step 1: define the corpus

Step 2: a retrieval function

Step 3: ask a question

Where naive retrieval falls apart

Adding the generation step

Picking a model

Prompt engineering for RAG

Handling multi-document synthesis

Common pitfalls in production RAG

Long context vs RAG: when to pick which

From toy pipeline to production agent

Skip the plumbing. Ship the agent.

Keep reading

Train AI on Your Own Data: The 2026 Playbook for Custom Support Agents

How GPT Chatbots Work in 2026: A Field Guide for Operators

AI Hallucinations in Support Agents: Why They Happen and How to Stop Them

What RAG actually is

Why RAG still matters when models have million-token windows

The anatomy of a RAG system

The corpus

The retriever

The generator

Building a minimal RAG in Python

Step 1: define the corpus

Step 2: a retrieval function

Step 3: ask a question

Where naive retrieval falls apart

Adding the generation step

Picking a model

Prompt engineering for RAG

Handling multi-document synthesis

Common pitfalls in production RAG

Long context vs RAG: when to pick which

From toy pipeline to production agent

Skip the plumbing. Ship the agent.

Keep reading

Train AI on Your Own Data: The 2026 Playbook for Custom Support Agents

How GPT Chatbots Work in 2026: A Field Guide for Operators

AI Hallucinations in Support Agents: Why They Happen and How to Stop Them