
The AI customer service market is on track to clear the mid-teens of billions of dollars this year, and almost every support team has bought into the category in some form. The problem is that buying AI and running AI well are two completely different exercises, and most companies are stuck between them.
I spend most of my week at Berrydesk talking to support leaders who are either evaluating an AI agent for the first time or already running one and quietly disappointed in the results. The story rarely changes. They picked a tool, pointed it at their FAQ page, dropped the widget on the homepage, and then waited for ticket volume to fall. It didn't. Their inbox still looks the same. Their customers still escalate. Their team is still drowning on Mondays.
The frustrating truth is that the underlying technology is no longer the bottleneck. In 2026 the models are dramatically better than what most teams started planning around 18 months ago. The bottleneck is the strategy wrapped around the model - what data you train on, what actions the agent can take, when it hands off to a human, and what you measure once it's live. This piece is about getting that strategy right.
The "chatbot" frame is holding you back
Before anything else, the language has to change. When a stakeholder hears "chatbot," they picture one of those 2019-era widgets that could barely recognize "what are your hours" before collapsing into a fallback menu. That mental model isn't just outdated - it's actively shaping budgets, expectations, and roadmaps in the wrong direction.
What's actually shipping today are AI agents, and the difference matters because the capability gap is enormous. A modern AI agent reasons over context, remembers what was said three turns ago, pulls live data from your systems, executes multi-step actions like issuing a refund or rescheduling a booking, and decides on its own when a human needs to step in. None of that was reliably possible at consumer scale even a year ago.
Part of what changed is the model layer underneath. Frontier closed models like GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Ultra are now genuinely good at multi-step reasoning, with Claude Opus 4.7 leading SWE-bench Pro at 64.3% and Gemini 3.1 Pro topping GPQA Diamond at 94.3%. Open-weight frontier releases - DeepSeek V4, Moonshot Kimi K2.6, Z.ai's GLM-5.1, Alibaba's Qwen 3.6 family, MiniMax M2.7, Xiaomi's MiMo-V2-Pro - have collapsed the cost of running production agents to fractions of a cent per resolution. If you are still shopping for a "chatbot," you are solving a 2026 problem with a 2020 lens, and you are almost certainly going to underbuy.
The handful of numbers that actually matter
I will spare you the 60-stat infographic. There are really four data points worth anchoring on when you are sizing this opportunity for your team.
The first is that roughly 88% of contact centers are using some form of AI, but only about 25% have fully integrated it. That gap is the entire story. Almost everyone has dipped a toe in. Almost nobody has wired AI deeply into their workflow, their backend systems, and their measurement stack. The companies winning are the ones who closed that gap intentionally instead of declaring victory after a pilot.
The second is the unit economics. Self-service interactions land somewhere around $1.84 per contact compared to roughly $13.50 for an agent-assisted ticket. That math is seductive - until you realize traditional self-service only resolves about 14% of issues end to end. The cost savings show up only when the AI can finish the job, which is where modern agentic models pull ahead of older deflection-style bots.
The third is preference. Around 79% of consumers say they prefer humans in the abstract, but 51% prefer a bot when they want immediate service. Read that twice. Customers don't hate AI. They hate waiting and being handed around. If your AI is fast, accurate, and can actually do something, the preference flips in your favor in a hurry.
The fourth is the quiet one: support agents using AI tools handle roughly 14% more inquiries per hour. Even when the AI never touches a customer directly, it makes the humans behind the scenes faster - drafting replies, summarizing threads, pulling the right knowledge article, flagging the sentiment of a long email. For a lot of teams, the "AI copilot" use case is more valuable than the "AI agent" use case. The best deployments do both.
Engagement isn't a personality, it's a resolution
Most "AI improves customer engagement" content reads like a feature checklist dressed up as advice. 24/7 availability. Faster responses. Personalization. Multilingual coverage. Yes, all true, all important - and all table stakes in 2026 rather than differentiators. None of those are why customers come back.
What actually moves engagement is something far less glamorous: resolution. Customers do not develop affection for your brand because the bot greeted them in their native language at 2 a.m. They develop affection because when their package was lost, the agent looked up the order, issued the replacement, sent a tracking link, and offered them a credit - without making them repeat their email address three times.
This is why the teams winning with AI in support are obsessing over resolution rate, not conversation volume. There is a meaningful difference between "our agent handled 10,000 conversations last month" and "our agent resolved 6,500 tickets without any human involvement last month." The first is a vanity metric. The second is a business metric, and it is the only one your CFO cares about.
A useful test: if you can't tell me what percentage of incoming tickets your agent fully closes - not just touches, not just routes, closes - you don't yet have an AI strategy. You have a deflection widget.
The implementation mistakes I see every single week
1. Training on the wrong data
Most teams point their agent at their public help center and call training done. The trouble is that help center articles are usually written for SEO, not for answering a real customer question. They are long, hedged, full of "depending on your plan, you may be eligible for…" disclaimers, and structured for search crawlers rather than human comprehension.
A modern AI agent will dutifully ingest all of that and start parroting the same hedging back to your customers. You need clean, direct, internal-quality answers - runbooks, decision trees, the actual policies your senior agents apply, not the customer-facing legalese. Berrydesk lets you train on docs, websites, Notion, Google Drive, and YouTube, but the source matters far more than the channel. Garbage in, polite-and-confidently-wrong garbage out.
A good exercise: pull the last 200 escalated tickets, write a one-paragraph internal answer to each one, and feed those in alongside your help center. The lift in resolution rate from this single change is usually larger than any model swap.
2. No real escalation path
Nothing destroys trust faster than a customer hitting a dead end with your AI and having no clear way to reach a human. Around 89% of consumers say they should always have the option to talk to a person - and they are right. The fix isn't to weaken your AI; it is to make handoff a first-class experience.
The strongest deployments treat the agent as a confident first responder that knows when it doesn't know. The moment confidence drops, the customer sees a clear, fast handoff path - into a live agent inside the same widget, into Slack or Discord for an internal team, or into WhatsApp for a sales rep. Crucially, the human picks up with the entire conversation context already in front of them. No "hi, can you tell me your order number again" after the customer just typed it three times.
3. Treating AI as a cost play instead of an experience play
If your entire business case is "we can fire three support reps," you have already lost. The teams seeing the strongest returns are framing AI as a way to deliver better, faster service at scale - and reinvesting some of the savings into letting their human team focus on the hard, high-value work that drives loyalty and expansion.
Better service compounds. Customers come back, they spend more, they tell people. That outcome is several multiples larger than the headcount line you were trying to trim. When leaders walk into AI as an experience program rather than a cost-cutting exercise, the political resistance from the support team also evaporates, because it stops being a layoff conversation.
4. Set it and forget it
AI agents need ongoing tuning, and most teams skip this entirely. Which queries are silently being answered wrong? Which topics escalate at twice the rate of the rest? Where is the agent confidently making things up? If you launched six months ago and haven't touched it since, your resolution rate has almost certainly drifted down as your product, pricing, and policy evolved out from under the training data.
Treat the agent like a new hire who needs a weekly review, not a vending machine you plug in once. A 30-minute Friday session looking at the worst conversations of the week, retraining on the gaps, and adjusting routing will outperform almost any feature upgrade.
5. Picking one model when you should be routing
This one is newer, and it's where 2026 changes the playbook. With open-weight frontier models like DeepSeek V4 Flash priced around $0.14 per million input tokens, MiniMax M2 running at roughly 8% of the cost of Claude Sonnet at twice the speed, and agentic open-weight models like Kimi K2.6 and GLM-5.1 hitting north of 58% on SWE-Bench Pro, there is no reason to send every conversation to the same model.
A smart deployment routes routine, high-volume traffic - order status, password resets, simple policy questions - to a fast, cheap open-weight model, and reserves Claude Opus 4.7, GPT-5.5, or Gemini 3.1 Ultra for the hard escalations, ambiguous cases, and anything involving a refund or a regulated decision. Berrydesk supports GPT, Claude, Gemini, DeepSeek, Kimi, GLM, Qwen, MiniMax, and others precisely because the right answer is "use several, route intelligently," not "pick a winner."
Long context vs. RAG: a quick aside
A common question right now is whether retrieval-augmented generation still matters when the frontier models ship with million-token context windows. Claude Opus 4.6 and Sonnet 4.6 carry a 1M-token window at no surcharge. Gemini 3.1 Ultra goes to 2M. DeepSeek V4 and Kimi K2.6 sit at 1M.
In practice, the answer for support is "use both." Long context lets the agent hold the entire conversation history, the customer's account record, and your top-level policy documents in-window without any retrieval gymnastics. RAG is still useful for surfacing the right snippet from a 500-article help center, audit trails, and keeping costs predictable. The shift is that RAG has moved from a hard architectural requirement to a tuning lever you reach for when it makes sense - not the default scaffolding around every query.
What to look for in a platform (skip the feature matrix)
Comparison matrices full of green checkmarks are nearly useless. Here are the questions that actually separate viable platforms from theater.
Can it take actions, or only answer questions? A support agent that can look up an order, process a return, update an address, check shipping status, book an appointment, or take a payment is an order of magnitude more useful than one that can only chat. If your agent cannot reach into your backend, you have built a fancier FAQ page. Berrydesk's AI Actions are designed for exactly this - bookings, payments, lookups, and custom workflows wired into your existing systems.
How many channels does it cover, natively? Customers don't live in your website widget. They are on WhatsApp, Slack, Discord, email, and inside your product. An agent that only ships as a website embed leaves most of your engagement opportunities on the table. Look for a platform where the same agent, with the same training and the same actions, deploys cleanly across all of them.
How fast can you actually ship? Some platforms take a quarter of pilot time before you go live. Others take a long afternoon. Speed matters because the only way you learn what works is by watching real customers use it. The faster you launch a v1, the faster you start the iteration loop that ultimately drives resolution rate up.
What does the escalation flow look like? Get specific with the vendor: when the agent isn't confident, what happens next? How fast does a human pick up? Is the entire conversation history in front of them automatically? Can the human take over inside the same channel? Most platforms wave their hands here, and customers feel it.
Can you actually see what's happening? Conversation count is a useless top-line. You need full-resolution analytics: resolution rate, escalation rate, CSAT split by AI vs. human, topic clustering for the conversations that went poorly, and the ability to read individual transcripts to audit quality. If you can't audit quality, you can't improve it.
Does it handle data residency and on-prem if you need it? For regulated industries, the rise of MIT- and Apache-licensed open-weight frontier models - GLM-5.1, Qwen3.6-27B, Xiaomi's MiMo - has finally made on-prem and air-gapped support deployments viable without giving up frontier-class quality. If that's on your roadmap, ask about it now, not after you sign.
Common pitfalls that don't get talked about enough
A few things I see go wrong even at teams that are otherwise doing this well.
The first is over-personality. Some teams spend weeks tuning the agent's voice, emoji usage, and signature sign-off, and zero weeks tuning what it can actually do. Personality is the icing. Resolution is the cake. Get the cake right first.
The second is hiding the AI. There is a temptation to disguise the agent as a human to avoid friction. Don't. Customers can tell, and when they figure it out - and they always figure it out - the trust loss is far worse than if you had simply said "Hi, I'm an AI agent, here's what I can do, and here's how to reach a human." Transparency is a feature.
The third is launching without a measurement plan. Decide in advance what success looks like - resolution rate north of some threshold, CSAT within a few points of human-only baseline, escalation rate dropping over time, cost per resolution under a target - and instrument for it from day one. Without that, every internal conversation about the AI becomes a vibes argument.
The bottom line
AI for customer support works. The data is overwhelming. But "works" has a precise meaning: it works when you build it as a resolution engine, not a deflection tool. It works when you train it on real, internal-quality answers and keep retraining it. It works when you let it take real actions in your backend. It works when you pair it with humans rather than trying to replace the team outright. And in 2026, it works far better when you route across multiple models - open-weight for volume, frontier closed models for the hard calls - instead of betting the deployment on a single LLM.
Industry analysts are forecasting that organizations will replace 20–30% of service agent capacity with generative AI over the next couple of years, though notably about half of the companies that planned aggressive workforce reductions are now walking those plans back. The teams that win this decade will not be the ones who fired the most reps fastest. They will be the ones who used the technology to deliver service their competitors literally cannot match at any price.
If you're evaluating AI for your support team right now, start with one concrete question: what are the five most common reasons customers contact us, and can this platform resolve at least three of them end to end without a human? If the answer is yes, you've found something worth piloting. If the answer is "well, it can suggest some help articles," keep looking.
Berrydesk is built for exactly this kind of deployment. You pick the model - or several - train the agent on your own data in a few minutes, brand the widget, wire up AI Actions for the workflows that matter, and ship it to your website, Slack, Discord, WhatsApp, and the rest. You can build an agent for free at berrydesk.com and watch your resolution rate before you commit to anything bigger. No sales call required.
Launch a support agent that actually resolves tickets
- Pick from GPT-5.5, Claude Opus 4.7, Gemini 3.1, DeepSeek V4, Kimi K2.6, and more
- Train on docs, websites, Notion, Drive, and YouTube - deploy in minutes
Set up in minutes
Chirag Asarpota is the founder of Strawberry Labs, the team behind Berrydesk - the AI agent platform that helps businesses deploy intelligent customer support, sales and operations agents across web, WhatsApp, Slack, Instagram, Discord and more. Chirag writes about agentic AI, frontier model selection, retrieval and 1M-token context strategy, AI Actions, and the engineering it takes to ship production-grade conversational AI that customers actually trust.



