
For most of the last three years, the phrase "AI chatbot" has been doing too much work. It has been used to describe everything from a thin FAQ widget pasted onto a marketing site, to a retrieval system that quotes a help center back at customers, to a fully agentic system that can change a subscription plan, refund an order, and schedule a follow-up call without a human ever touching the queue.
In 2026 those things are not the same product anymore. The gap between a chatbot and a real support agent is now a chasm - different model class, different architecture, different operating cost, different conversion to resolution rate. And it is the second category that customers now expect when they open a support widget.
This post is about that shift: what actually changed under the hood, what an agent can do that a chatbot cannot, and how a support team should think about deploying one today. We build Berrydesk around this thesis, so we have skin in the game, but the argument here is independent of any one platform.
The chatbot era was a stepping stone
Early support bots were retrieval engines wearing a chat costume. You pointed them at a help center, embedded the pages into a vector store, and at runtime the system pulled the closest passages and asked a small language model to rephrase them. The model's job was almost cosmetic. The substance came from whatever the search layer happened to retrieve.
This pattern worked for a narrow band of tickets - the ones whose answer was already a single, well-written paragraph somewhere in the docs. For everything else it failed in predictable ways. It hallucinated when the retrieval was thin. It looped when a customer rephrased the same question. It could not reason across two documents at once. It had no notion of the customer's actual account state, so it could only describe what should happen, never check whether it did. And critically, it could not act. The bot could tell you that you were eligible for a refund. It could not issue one.
Support leaders learned to live with this. The metric that mattered was deflection - how many tickets the bot took off the human queue - and a 20–30% deflection rate was considered a strong result. The other 70%+ went to humans, because the bot had no way to handle anything that required state, context, or a tool call.
That ceiling is gone. The combination of frontier reasoning models, million-token context windows, and reliable tool-use has pushed agentic resolution into the realm of routine engineering. The interesting question is no longer whether an AI can resolve a real ticket. It is which model you should route the ticket to, and which tools the agent needs access to.
What actually changed in the model layer
Three things shifted in the last twelve months, and together they explain why agents finally work.
Reasoning got serious
The frontier closed models - OpenAI's GPT-5.5 and GPT-5.5 Pro with parallel reasoning, Anthropic's Claude Opus 4.7 leading SWE-bench Pro at 64.3%, Google's Gemini 3.1 Pro topping GPQA Diamond at 94.3% - are not marginally better than the GPT-4-class models support teams started with. They handle multi-step problems where the agent has to read a policy document, check the customer's account, decide whether they qualify, and then call the right tool with the right arguments. That is the unit of work a support ticket actually consists of, and the previous generation of models could not close it without a hand-built decision tree wrapping every conversation.
Agentic-first open-weight models pushed this further. Moonshot's Kimi K2.6 runs autonomous coding sessions for twelve hours with swarms of up to three hundred sub-agents and four thousand coordinated steps. Z.ai's GLM-5.1 - a 754B-parameter MoE released under MIT - runs an eight-hour plan-execute-test-fix loop and posts 58.4 on SWE-Bench Pro, beating both GPT-5.4 and Claude Opus 4.6 on that benchmark. Xiaomi's MiMo-V2-Pro, with weights open-sourced under MIT, pushes past a trillion total parameters with 42B active and a million-token context. Support tickets are not coding agents, but the underlying capability - long horizon reasoning with reliable tool calls - is exactly what an agent needs to chain "look up the order, check the policy, refund the line item, send a confirmation email" without falling off the rails.
Context windows stopped being a constraint
A year ago, support architects spent most of their time tuning retrieval. They worried about chunk size, embedding choice, hybrid search, re-ranker latency, top-K cutoffs. The reason was simple: the model could only see a few thousand tokens at a time, so getting the right few thousand tokens in front of it was the whole game.
That changed. Anthropic's Claude Opus 4.6 and Sonnet 4.6 ship with a 1M-token context window at no surcharge. Gemini 3.1 Ultra goes to 2M tokens, natively multimodal across text, image, audio, and video. Even on the open side, DeepSeek V4 Pro and V4 Flash both ship with 1M context, and MiMo-V2-Pro matches that. A typical SaaS company's entire help center, runbook library, refund policy, and the full conversation history of an individual customer comfortably fits in-context on any of these models.
This does not kill RAG - for very large knowledge bases or freshness-sensitive data, retrieval is still the right tool. But it turns RAG from a hard architectural requirement into a tuning lever. You can ship a working agent that just stuffs the relevant docs into the prompt, watch where it breaks, and only then invest in a retrieval layer to fix the bottleneck. The order of operations flipped.
The cost floor collapsed
The third change is the one most teams are sleeping on. Open-weight frontier models from DeepSeek, Z.ai, Moonshot, MiniMax, Alibaba, and Xiaomi have collapsed the cost of running a production agent.
DeepSeek V4 Flash - a 284B-parameter MoE with 13B active - is priced at $0.14 per million input tokens and $0.28 per million output. MiniMax M2 and M2.7 - open-weight, self-evolving agent models - clock in at roughly 8% the price of Claude Sonnet at twice the speed, with M2.7 hitting 56.22% on SWE-Pro. Alibaba's Qwen3.6-27B is dense, Apache-licensed, and beats some 397B-parameter MoE rivals on agentic coding benchmarks. None of these were realistic to operate a year ago. All of them are now.
What this unlocks is routing. A support agent does not need a frontier model on every turn. It needs the cheap, fast, capable open-weight models on the long tail of routine intent - order lookups, password resets, shipping status - and a Claude Opus 4.7, GPT-5.5, or Gemini 3.1 Ultra on the hard escalations where reasoning quality is load-bearing. A well-routed deployment can cut per-resolution cost by an order of magnitude versus a single-frontier-model setup, without giving up the quality on the calls that matter.
What "taking action" actually looks like
The defining feature of an agent - versus a chatbot - is that it does not stop at the answer. It executes.
Concretely, on Berrydesk, an AI Action is a typed function the agent can call mid-conversation. The most common ones we see in production support deployments are:
- Order lookup and modification. The agent reads the customer's order from your commerce backend, checks fulfillment status, and offers options scoped to the actual state of the order, not a generic policy page.
- Refunds and payment changes. Wired to Stripe (or whatever the payment processor is), the agent calculates the refund, applies the right policy carve-outs, and issues it inside the conversation. The customer never gets a "we'll get back to you within 24 hours" reply.
- Booking and rescheduling. Calendar integrations let the agent see real availability, propose specific slots, and confirm. This single action collapses what used to be a four-message email thread into one chat turn.
- Subscription changes. Upgrade, downgrade, pause, cancel, switch billing period - handled in-line, with the agent narrating what it just did and why.
- Account-level operations. Reset password, change email, transfer ownership, add a seat, revoke an API key. The agent authenticates the customer, performs the action, and confirms the result.
Each of these is the difference between a 30% deflection rate and a 70%+ resolution rate. A chatbot that explains how to do a thing is fundamentally different from an agent that does the thing.
What makes this work in 2026, where it broke in 2024, is the reliability of structured tool-calling on agentic-first models. Kimi K2.6, GLM-5.1, Claude Opus 4.7, Qwen3.6, and MiMo-V2-Pro all treat tool use as a first-class output, not a fragile prompt-engineering trick. The agent picks the right tool, fills the right arguments, handles the response, and decides what to do next. AI Actions stop being demoware and start being the default operating mode.
How a real deployment ships
Berrydesk is built around the idea that you should not have to assemble this stack yourself. The four-step path looks like this in practice.
1. Pick the model - or let the platform route
You can pin the agent to a single provider - GPT-5.5, Claude Opus 4.7, Gemini 3.1 Ultra - or pick from the open-weight roster: DeepSeek V4, Kimi K2.6, GLM-5.1, Qwen3.6, MiniMax M2.7. Most production deployments use a routing setup: a fast, cheap open-weight model for the high-volume routine intents, with escalations to a frontier closed model when the conversation crosses a complexity or confidence threshold. The MIT/Apache-licensed open weights - GLM-5.1, Qwen3.6-27B, MiMo - make on-prem and air-gapped deploys realistic, which matters in regulated industries.
2. Train on the sources you already have
Point the agent at your help center, marketing site, product docs, Notion workspace, Google Drive folder, or YouTube channel. Long-context models mean you can be aggressive about how much you ingest - there is little penalty for over-including context that the agent only sometimes needs. Update the sources and the agent stays in sync; this is the part of the workflow that used to require a custom data pipeline and now does not.
3. Brand the widget
The agent is your brand's support surface, not a third-party chatbot. Match the tone, the visual style, and the personality of the rest of your product. This sounds cosmetic and is not - customers behave differently in a chat that feels like part of the product than in one that feels bolted on.
4. Wire up the AI Actions
This is where the agent stops being a smarter FAQ and starts being a labor replacement. Connect to Stripe for payments and refunds, Cal.com or Calendly for scheduling, your CRM for account state, your commerce platform for orders, your internal API for anything custom. Every action you wire up is one more category of ticket that resolves without a human.
5. Deploy where customers already are
A site widget is the obvious surface, but the more interesting deploys are the ones that meet customers in their existing channels: Slack, Discord, WhatsApp, email. The agent's behavior is consistent across surfaces, the integrations are the same, and the conversation context follows the customer.
The trade-offs that actually matter
A few decisions are worth thinking through up front rather than discovering at scale.
Open-weight versus closed frontier. Open-weight models are dramatically cheaper, can be self-hosted, and on agentic benchmarks the leaders are now competitive with closed models. Closed frontier still wins on the hardest reasoning and on multimodal breadth. The right answer for most support deployments is "both, routed by intent."
Long context versus retrieval. With 1M–2M-token windows, RAG is no longer the only option. For knowledge bases under a few hundred thousand tokens, in-context is simpler and often more accurate. For very large or freshness-sensitive corpora, RAG still wins. The mistake is treating one as the default; pick based on the corpus.
Agent autonomy versus guardrails. An agent that can issue refunds is also an agent that can issue wrong refunds. The right pattern is bounded autonomy: full automation under a value or risk threshold, human-in-the-loop above it. Define the bounds explicitly and the agent will respect them; leave them implicit and the policy ends up in the weights.
Common pitfalls when moving from a chatbot to an agent
Three failure modes are worth flagging because they are common and avoidable.
The first is shipping a chatbot pretending to be an agent. If your "AI agent" cannot call any tools, it is a chatbot. Wire up at least the AI Actions for your top three ticket categories before you launch, not after.
The second is single-model lock-in. Choosing one provider for cost reasons, then watching that provider lag the frontier on a critical capability, is a story that has played out repeatedly since 2023. Build assuming you will swap models, and test new releases against your real ticket distribution rather than benchmark scores.
The third is measuring deflection instead of resolution. A deflected ticket that the customer immediately reopens is worse than no automation at all. Track end-to-end resolution, customer satisfaction on AI-handled tickets, and reopen rate. Those numbers tell you whether the agent is actually working.
What this unlocks for support teams
The honest version of the pitch is that this changes the unit economics of running a support function. A team that was previously sized to handle X tickets per agent per day now has an AI layer doing the routine work, with humans focused on the calls where empathy, judgment, or business context is irreducible. Cost per resolution drops. Time-to-first-response goes from minutes or hours to seconds. Coverage extends to off-hours and additional languages without headcount.
The less obvious part is that the customer experience changes shape. A customer who can resolve a complex issue in chat, end-to-end, in under a minute, comes away with a different impression of your product than one who waited two hours for a templated email. Support stops being a cost center to be minimized and starts being a surface to be invested in.
If you are running a support team and you are still on a 2024-era chatbot, the right question is not whether to upgrade. It is which agent to ship, on which model, with which actions wired up first.
You can build one on Berrydesk in an afternoon: pick the model, train it on your sources, brand the widget, wire up the AI Actions, deploy. Routine traffic routes to the cheap open-weight models, escalations go to the frontier, and you keep both the cost curve and the quality curve.
Launch a support agent that actually resolves tickets
- Train on docs, websites, Notion, Drive, and YouTube in minutes
- Wire up AI Actions for bookings, refunds, and payments out of the box
Set up in minutes
Chirag Asarpota is the founder of Strawberry Labs, the team behind Berrydesk - the AI agent platform that helps businesses deploy intelligent customer support, sales and operations agents across web, WhatsApp, Slack, Instagram, Discord and more. Chirag writes about agentic AI, frontier model selection, retrieval and 1M-token context strategy, AI Actions, and the engineering it takes to ship production-grade conversational AI that customers actually trust.



