
The word "chatbot" gets used for almost every piece of software that talks back to a customer, which is part of the problem. A scripted reservation flow on a restaurant website and a fully agentic support copilot running on Claude Opus 4.7 are both technically chatbots, but they live on opposite ends of a very long spectrum. Treating them as the same category is how teams end up spending months on the wrong tool.
The category has also shifted faster than most buyers' mental models. The "AI chatbot" that felt cutting-edge two years ago - a GPT-3.5 wrapper with a vector store - is closer to the bottom of today's market than the top. The 2026 frontier is something else: million-token context windows, agentic tool use across hundreds of steps, and open-weight models from DeepSeek, Z.ai, Moonshot, MiniMax and Alibaba that have collapsed the cost of running serious automation.
So before you commit a quarter to "rolling out a chatbot," it is worth being precise about which kind. This guide walks through the seven main archetypes, what each is genuinely good at in 2026, where each one breaks, and how to match the right pattern to the work you are actually trying to automate.
1. AI Agents (the modern frontier)
Sitting at the top of the spectrum are the systems most people now mean when they say "AI chatbot," though the more honest label is AI agent. These run on a frontier large language model - GPT-5.5 or 5.5 Pro from OpenAI, Claude Opus 4.7 or Sonnet 4.6 from Anthropic, Gemini 3.1 Ultra or Pro from Google, or one of the open-weight leaders like DeepSeek V4, Kimi K2.6, GLM-5.1, Qwen 3.6, MiMo-V2-Pro or MiniMax M2.7. They reason over context, draw on a knowledge base, and increasingly take actions on behalf of the user.
What distinguishes them from earlier "AI chatbots" is not just better language. It is two structural changes. First, context windows have ballooned: Claude Opus 4.6 ships 1M tokens at no surcharge, Gemini 3.1 Ultra holds 2M, and DeepSeek V4 Flash matches that million-token range at $0.14 / $0.28 per million input/output tokens. Your entire help center, return policy and last six months of conversation history can sit inside a single prompt. Second, agentic tool use has crossed from demoware into production. Kimi K2.6 runs autonomous coding sessions that span 12 hours and 4,000 coordinated steps; GLM-5.1 sustains an 8-hour plan-execute-test-fix loop; Claude Opus 4.7 leads SWE-bench Pro at 64.3%. That same reliability is what makes booking flows, refund pipelines, and order-lookup actions feel genuinely safe to put in front of customers.
What they do well
- Understand how real customers phrase things, including typos, multi-intent questions, and conversations that pivot mid-thread.
- Hold context across turns, escalations, and even sessions, so the customer never has to repeat themselves.
- Train on your own corpus - product docs, Notion, Drive, websites, YouTube transcripts - and reason over it instead of just retrieving it.
- Take real actions: pull up an order, refund a charge, book a demo, change a subscription, escalate to a human with a full transcript and a one-paragraph summary.
Best for: companies serious about deflection rates, with enough volume that scripting individual flows is no longer practical, and a service surface that includes both information requests and transactional asks.
Concrete example: A 200-seat B2B SaaS deploys a Berrydesk agent trained on its docs, billing pages and changelog, and routed across three models: DeepSeek V4 Flash for "how do I" questions, Claude Opus 4.7 for billing edge cases, and Gemini 3.1 Pro for anything multimodal (screenshots, exported PDFs, video walkthroughs of a bug). Tier-1 ticket volume drops 62% in the first quarter, and the team redeploys two support engineers onto onboarding.
Watch out for: training data hygiene. A frontier model is only as accurate as the source material it sees. Stale FAQs, contradictory policies and abandoned help articles will show up as confident, fluent answers that are wrong.
2. Rule-Based Chatbots
At the other end of the spectrum, rule-based bots are still everywhere, and for the right job they are genuinely the right tool. They follow a fixed decision tree - if the user clicks "Track an order," ask for the order number, then look it up, then return the status. There is no model, no inference, no surprises.
The strength of a rule-based bot is also its limit: it does exactly what you scripted, and nothing else. That is perfect when the task is bounded and the customer just needs you to collect three pieces of structured information in a known order. It is a disaster the moment a real customer types a sentence you didn't anticipate.
What they do well
- Cheap to build, cheap to run, easy to audit.
- Perfectly consistent - the same input always produces the same output, which compliance teams appreciate.
- Predictable cost: no per-token billing, no model drift between versions.
Best for: highly structured, single-purpose flows like reservation booking, return label generation, or appointment scheduling where the conversation truly is a form in disguise.
Concrete example: A regional restaurant group uses a rule-based widget on its booking page to take party size, date, time and contact details. There is nothing to "understand," and an LLM would be overkill.
Watch out for: the moment your scope grows beyond two or three flows, rule-based bots start to feel like a maze. Customers go off-script, the tree branches multiply, and every change requires a developer or a flow-builder session. Most teams that start here outgrow it inside a year.
3. Keyword-Recognition Chatbots
Keyword bots are a half-step up from pure rule-based: they let the user type freely, but the engine is still essentially pattern matching. If "shipping" and "time" appear together, route to the delivery FAQ; if "refund" appears, route to the returns flow.
In 2026 this category is mostly legacy. The reason is straightforward: a frontier-model agent costs roughly the same per resolution as a keyword bot - sometimes less, once you route routine traffic to DeepSeek V4 Flash or MiniMax M2 - and is dramatically more accurate. A typical Berrydesk deployment running open-weight models on commodity inference can serve a routine query for a fraction of a cent. There is little economic argument left for shipping a keyword matcher in front of customers.
What they do well
- More flexible than menu-based rule bots, since users can type freely.
- Easy to extend with new keywords as new product questions emerge.
Best for: very small, very stable knowledge surfaces - a single product, a small FAQ - where investing in real AI is genuinely overkill.
Concrete example: A solo creator selling one digital product uses a keyword bot to answer "where is my download," "I lost my license key," and "how do I get a refund." Three keywords, three responses, and that is the whole product surface.
Watch out for: misspellings, synonyms, and sentences that contain a keyword but mean something else ("I don't want a refund, I just want to know your refund policy"). Each of these is a quietly broken interaction the bot will never tell you about.
4. Machine-Learning Chatbots (the pre-LLM generation)
This category is what dominated customer support automation from roughly 2018 through 2023: intent classifiers trained on tagged conversation data, paired with a response retrieval engine. Architecturally these are a step above keyword matching - they understand intent, not just words - but they are firmly below modern LLM-based agents.
In a 2026 stack, pure intent-classifier bots survive mainly inside large enterprise platforms that haven't migrated yet. The pattern still matters, though, because a lot of modern AI agents quietly use intent classification under the hood - for example, to decide whether to route a request to a cheap open-weight model like DeepSeek V4 Flash or escalate it to Claude Opus 4.7.
What they do well
- Genuinely improve with usage data, particularly around your most frequent intents.
- Can be tuned to a specific industry vocabulary (insurance, finance, healthcare).
- Predictable behavior - they classify into a fixed set of intents, then retrieve a fixed response.
Best for: legacy estates that already have years of labeled support data and a team comfortable maintaining ML pipelines.
Concrete example: A bank's contact center runs an intent classifier across 80 supported intents - card lost, dispute charge, change PIN, statement copy. Volume is high, the language is bounded, and accuracy on the top 20 intents matters more than open-ended conversation.
Watch out for: retraining cost and the long tail. Anything outside your top intents collapses into a generic fallback. Modern LLM agents handle the long tail natively, which is why most ML-only deployments are gradually being replaced or wrapped.
5. Hybrid Chatbots
Hybrid bots are the practical answer most enterprise teams actually ship. They combine deterministic flows for the parts of the conversation that genuinely are forms - collect a policy number, confirm an email, capture an address - with an LLM-driven layer for the parts that are real conversation.
Done well, this gives you the auditability of a scripted flow where regulators, legal or finance need it, and the flexibility of a frontier model everywhere else. Done poorly, it produces visible seams: the bot suddenly stops sounding fluent because it has dropped into a rigid step.
What they do well
- Predictable for structured data capture; flexible for everything else.
- Easier to satisfy compliance, since the regulated parts of the flow are deterministic and loggable.
- Natural migration path: you can start fully scripted and gradually replace branches with LLM-driven handling.
Best for: mid-market and enterprise companies in regulated verticals - insurance, finance, healthcare, travel - where some interactions must follow a fixed script and others must feel human.
Concrete example: A health insurer uses a hybrid agent to handle pre-authorization questions. A scripted module collects member ID, date of birth and procedure code; an LLM layer running on Qwen 3.6-Plus interprets the member's plain-language description of their condition and surfaces the right policy section, with citations. The scripted layer keeps audit logs the compliance team can defend; the LLM layer keeps the conversation from feeling like a 1998 IVR.
Watch out for: the handoff. Customers notice when the bot's voice changes mid-conversation. Spend real design time on the transitions - even the same model used to "soften" the scripted prompts goes a long way.
6. Voice Bots
Voice bots are conversational AI on a phone line, kiosk or in-car system. The architecture sits a layer above the chat archetypes above: speech-to-text feeds a language model, the model produces a response, text-to-speech speaks it back, often with sub-second latency targets. Behind the scenes, the language layer can be any of the same models used for text - though latency budgets push most teams toward smaller, faster variants.
Two things have changed in 2026 that matter. First, latency on frontier reasoning has dropped enough that real-time voice with an open-weight model like MiniMax M2 (8% the price of Claude Sonnet at 2x speed) is finally practical. Second, native multimodality in models like Gemini 3.1 Ultra means voice no longer has to be transcribed into text and back; the model reasons over the audio directly, which preserves prosody, hesitation and intent in ways pure transcription can't.
What they do well
- Hands-free interaction in cars, warehouses, clinics and contact centers.
- High accessibility for visually impaired users and customers who simply prefer speaking.
- Direct replacement for legacy IVR menus, with real comprehension instead of "press 2 for billing."
Best for: healthcare clinics, automotive, logistics, hospitality, and any business already paying for a phone-based contact center.
Concrete example: A multi-location dental group replaces its phone tree with a voice bot that handles appointment booking, prescription refills and insurance verification. The bot answers in under a second, books into the practice management system via an AI Action, and only escalates to a human when a clinical question arises.
Watch out for: noise, accents, and barge-in handling. Voice systems that sound great in a demo room frequently fall apart in a moving car or a busy clinic lobby. Test in the actual environment before you scale.
7. Messaging-Channel Bots (SMS, WhatsApp, Slack, Discord, and beyond)
The seventh archetype is less about the underlying engine and more about the surface: the bot lives where the customer already is. SMS still has the broadest reach - open rates above 95%, no app install, no internet required - but in 2026 the more interesting channels for most businesses are WhatsApp (over 2B users), Slack and Discord for B2B and community-driven products, and increasingly platform-native chat inside iMessage and RCS.
The pattern that wins on these channels is the same frontier-model agent from category 1, deployed as a messaging endpoint. A Berrydesk agent, for example, can be configured once and surfaced simultaneously on a website widget, a WhatsApp number, a Slack workspace, and a Discord server, all sharing the same training data and the same AI Actions.
What they do well
- Meet customers on the channel they already use, instead of asking them to come to your site.
- High engagement: messaging-channel deflection rates routinely beat website widgets, because the message thread persists and the customer doesn't have to repeat themselves on return.
- Native to async support - a customer can ask a question Friday night and finish the conversation Monday morning without losing context.
Best for: consumer brands with global reach (WhatsApp), software companies with active communities (Discord, Slack), and any business already running marketing or transactional messaging through SMS.
Concrete example: A direct-to-consumer skincare brand runs a WhatsApp agent for order tracking, exchanges and routine questions, and a separate Discord agent for its community of power users. Both share a single knowledge base in Berrydesk; the WhatsApp agent uses DeepSeek V4 Flash for cost, the Discord agent uses Claude Opus 4.7 for nuanced product chemistry questions.
Watch out for: channel-specific format limits (SMS character caps, WhatsApp template approvals, Slack message formatting) and per-message cost. A high-volume SMS deployment can quietly become expensive if you don't model the unit economics first.
How to choose: a practical framework
In 2026 the choice is rarely "AI or not." It is "which AI, on which channels, with how much determinism layered in?" A few prompts to anchor the decision:
Start from the work, not the technology. Pull a representative sample of 200 real conversations from your support inbox. Tag each one as either "structured form" (looks like collecting fields), "knowledge lookup" (the answer is in your docs), or "action" (the customer needs you to do something - refund, reschedule, escalate). The mix tells you what kind of bot you actually need. Heavy form weight pushes you toward rule-based or hybrid; heavy knowledge and action weight pushes you toward an AI agent.
Match the channel to the customer. If your customers are global consumers, WhatsApp is non-negotiable. If they are developers, Discord and Slack matter more than a website widget. If they are clinical staff with their hands full, voice beats anything else. The channel decision often constrains the engine decision more than people expect.
Pick a model strategy, not a model. A single-model deployment in 2026 is usually a mistake. The more economical pattern is to route routine traffic to an open-weight model (DeepSeek V4 Flash, MiniMax M2, Qwen 3.6) and reserve a frontier model (Claude Opus 4.7, GPT-5.5 Pro, Gemini 3.1 Ultra) for hard escalations. Berrydesk supports this routing natively, which means you can tune cost per resolution without shipping a worse experience.
Decide how much determinism you actually need. Regulated industries genuinely benefit from hybrid designs. Most B2C and B2B SaaS use cases do not - the determinism people ask for at the start often turns out to be a comfort blanket that hurts deflection.
Plan for AI Actions from day one. The biggest jump in deflection in 2026 is not better answers, it is bots that can actually do the thing - refund the charge, change the shipping address, book the demo. If your platform doesn't have a clean way to wire actions into the conversation, you are building a glorified FAQ. Make sure the platform you pick treats actions as first-class, not as a roadmap promise.
Measure relentlessly. Resolution rate, escalation rate, customer satisfaction post-conversation, and cost per resolution are the four numbers that matter. Any platform that cannot show you all four in a dashboard is a platform you will outgrow.
Open-weight or closed frontier? A short trade-off
One question that comes up in almost every deployment: should you build on Claude Opus 4.7 / GPT-5.5 / Gemini 3.1, or on the open-weight frontier - DeepSeek V4, GLM-5.1, Kimi K2.6, Qwen 3.6, MiMo-V2-Pro?
The honest answer is "both, and route between them." Closed frontier models still have an edge on the hardest reasoning, the most ambiguous requests, and anything multimodal at scale. Open-weight models - particularly under MIT or Apache 2.0 licenses like GLM-5.1 and Qwen 3.6-27B - have an edge on cost, on data residency, and on being deployable on-prem or air-gapped for regulated industries. A customer-support stack that runs 70% of its traffic on a $0.14-per-million-token open-weight model and reserves 30% for a frontier model usually beats either extreme on combined cost and quality.
RAG or long context?
A second 2026 trade-off worth naming: with 1M-token context on Claude Opus 4.6 and Sonnet 4.6, and 2M on Gemini 3.1 Ultra, you can simply put your entire knowledge base in the prompt for many businesses. RAG (retrieval-augmented generation) becomes a tuning lever rather than a hard requirement. The right answer depends on your corpus size and update cadence: a 50-page product doc fits comfortably in long context; a 50,000-article help center still needs RAG. Most teams end up using both - RAG for breadth, long context for the recent conversation and the most-relevant policies.
The shortest version
If you remember nothing else: the 2026 default for customer support is an AI agent built on a frontier or open-weight LLM, with action-taking wired in from day one, deployed across the channels your customers already use. Rule-based, keyword and pure ML bots still have legitimate niches, but they are niches, not defaults. Voice and messaging-channel bots are surfaces for the agent, not separate categories underneath.
Berrydesk is built for exactly this shape of deployment. Pick a model - GPT-5.5, Claude Opus 4.7, Gemini 3.1, DeepSeek V4, Kimi K2.6, GLM-5.1, Qwen, MiniMax, and more. Train it on your docs, websites, Notion and Drive. Brand the chat widget. Wire in AI Actions for the bookings, refunds and lookups that drive real deflection. Then deploy to your website, Slack, Discord, WhatsApp and beyond, all from one place.
If you are in the middle of choosing a chatbot type for your team, the fastest way to decide is to build one. Spin up a Berrydesk agent for free at berrydesk.com, point it at your existing knowledge base, and watch which conversations it resolves on its own. The right archetype usually becomes obvious within a week of real traffic.
Launch a production-grade AI support agent in minutes
- Pick from GPT-5.5, Claude Opus 4.7, Gemini 3.1, DeepSeek V4, Kimi K2.6, GLM-5.1 and more
- Train on docs, websites, Notion and Drive, then deploy to web, Slack, WhatsApp and Discord
Set up in minutes
Chirag Asarpota is the founder of Strawberry Labs, the team behind Berrydesk - the AI agent platform that helps businesses deploy intelligent customer support, sales and operations agents across web, WhatsApp, Slack, Instagram, Discord and more. Chirag writes about agentic AI, frontier model selection, retrieval and 1M-token context strategy, AI Actions, and the engineering it takes to ship production-grade conversational AI that customers actually trust.



