
An AI agent that answers questions is a feature. An AI agent you can actually measure is an operation. The gap between those two states is where most support teams stall in 2026 - they ship the bot, traffic flows in, and three months later nobody can answer whether it is making customers happier, saving real money, or quietly leaking trust.
This guide is the playbook we hand to teams who want to close that gap. It walks through the analytics that matter for an AI support agent today, why each metric is worth a slot on your dashboard, and the traps that catch teams who optimize the wrong number. The vocabulary has not changed much in the last few years - engagement, resolution, satisfaction, cost - but the mechanics absolutely have, because the model layer underneath has changed beyond recognition.
A 2024-era chatbot built on GPT-4 had a different failure mode than a 2026 agent built on GPT-5.5 Pro, Claude Opus 4.7, or DeepSeek V4. Long-context windows, native tool use, and dramatically cheaper inference shift what you should expect from your agent and what you should be watching. We will weave that throughout.
Why analytics decide whether your AI agent earns its keep
Plenty of teams treat analytics as a reporting chore - a screenshot for the QBR, a number to put in the board deck. That is a mistake. For an AI support agent, the analytics layer is the only honest signal you have about three things: whether your customers are getting helped, whether your business is paying a fair price for that help, and whether the agent is drifting in ways your team cannot otherwise see.
A few reasons to take it seriously from day one:
Effectiveness is not self-evident. Modern models are confident, fluent, and sometimes wrong. A poorly trained agent on Claude Opus 4.7 will sound exactly as polished as a well-trained one. Without resolution and fallback metrics, you will not catch the difference until customers are already frustrated.
Experience is a moving target. What "good" looks like in your support conversations evolves with your product, your seasonality, and your customer base. Tracking satisfaction and conversation patterns over time is how you notice that the agent that worked great in February has started leaking goodwill by May.
Unit economics are now a real lever. Open-weight frontier models from DeepSeek, Z.ai, Moonshot, and MiniMax have collapsed the cost floor for production support agents. DeepSeek V4 Flash sits at roughly $0.14 per million input tokens; MiniMax M2 runs at about 8% the price of Claude Sonnet at twice the throughput. The cost difference between a well-routed agent and a lazily routed one can be 10x or more. If you are not measuring cost per resolution, you are not pulling that lever.
Drift is the silent killer. Knowledge bases change. Pricing pages update. Refund policies shift. An agent that was 92% correct in January is not automatically 92% correct in May. Continuous measurement is how you catch the slope before it becomes a cliff.
With that motivation set, here are the metrics worth a permanent home on your dashboard, organized into four buckets: engagement, conversation quality, channel, and cost.
Engagement metrics: is anyone actually using this thing
Engagement metrics tell you about reach and adoption. They do not tell you whether the agent is good - but they are how you spot whether it is invisible, ignored, or overwhelmed.
Total conversations
Count every distinct conversation the agent has handled in a given window: day, week, month. This is the most basic adoption signal. Plot it as a trendline rather than a single number, because the shape matters more than the absolute value. A flat line on a growing site means the widget is not being discovered. A spike correlated with a marketing campaign tells you about funnel impact. A weekend trough on a B2B account tells you who your real users are.
When you split this by entry point - homepage widget, in-product help, Slack channel, WhatsApp number - you also get a sense of which surfaces deserve more product investment.
Engagement rate
Of the people who open the chat, what percentage actually send a message and continue past the first turn. This filters out tire-kickers and gives you a cleaner read on whether your widget copy, default greeting, and suggested prompts are pulling their weight. A low engagement rate is rarely a model problem; it is almost always a UX or positioning problem on the launcher itself.
A good baseline to chase is 40–60% for a public-facing widget; signed-in product surfaces tend to run higher because intent is stronger.
Average conversation length
Track the mean and the distribution of messages per conversation. Length is a double-edged signal. A two-turn conversation might mean the agent answered perfectly on the first try, or it might mean the user gave up. A twelve-turn conversation might mean a rich, helpful interaction, or it might mean the agent kept missing the point and the user kept rephrasing.
Read this metric alongside resolution rate and CSAT. The combination tells you the story: short and resolved is great, short and unresolved is abandonment, long and resolved is acceptable, long and unresolved is the bug you need to fix.
Interaction frequency and recurrence
How often do the same users come back. For a support agent on a B2B SaaS product, recurrence is healthy - it means people trust the agent enough to make it a habit. For a one-time-purchase ecommerce flow, recurrence might mean the agent failed the first time and the user is trying again. Segment your recurrence numbers by user type and by use case before you read into them.
You should also watch the time-of-day and day-of-week distribution. Most teams discover the agent's biggest value zone is the after-hours window where human agents are offline. If 60% of your AI traffic is between 6pm and 8am local time, that is a number worth showing to the executive who wrote the budget.
Conversation quality metrics: is the agent actually good
Engagement tells you about the funnel. Conversation quality metrics tell you whether the agent earns the trust the funnel is sending its way.
Resolution rate
The single most important number on the dashboard. Resolution rate is the percentage of conversations where the user got what they came for, without needing a human. There are several ways to measure it:
- Self-reported: ask the user at the end of the conversation whether their issue was resolved. Cheap to implement, but biased - happy users answer more often than frustrated ones.
- Behavioral: infer resolution from outcome signals. Did the user stop messaging? Did they not open a ticket in the next 24 hours? Did they not return with the same intent?
- LLM-judged: have a separate model read the transcript and score whether the user's stated goal was met. With Claude Opus 4.7 or GPT-5.5 in the loop, this has gotten remarkably reliable as long as you tune the rubric.
A solid production AI agent in 2026, built on a frontier or strong open-weight model and trained on a real knowledge base, should be hitting 70–85% resolution on routine support traffic. Below 60%, you have a content or routing problem. Above 90%, you are either truly excellent or your scoring is too generous - go double-check.
Fallback and escalation rate
The fallback rate is the percentage of user messages the agent could not handle and either punted with a generic response or escalated to a human. Treat fallbacks as a gift: each one is a labeled gap in your knowledge base or a missed AI Action wired up to a real workflow.
Bucket fallbacks by intent. The same five categories usually account for 80% of them - unsupported language, edge-case product questions, account-specific lookups the agent cannot perform, billing disputes, and policy exceptions. Each bucket has a different fix. Language gaps are a model or prompt issue. Account lookups are an AI Actions issue. Policy exceptions are a content issue. Treating "fallback rate" as one number hides where the real work is.
CSAT and conversation-level satisfaction
A simple post-conversation thumbs up / thumbs down, optionally followed by a one-line "what went wrong," will get you 80% of the value of any sophisticated CSAT system. Track the rolling thirty-day score and watch the trend, not the daily number. Daily noise in CSAT is mostly composition - a few angry users on a slow day swing the percentage hard.
The signal you really care about is the qualitative one underneath the score. Pipe negative-feedback transcripts into a weekly review session. The patterns you see - "the agent kept telling me to email support@" - are usually fixable in an afternoon.
Conversion or task completion
If your agent is doing more than answering questions - booking a demo, processing a refund, scheduling an appointment, taking a payment - measure how often the action actually completes. This is where 2026's agentic models earn their premium. Tool-use accuracy on Claude Opus 4.7, Kimi K2.6, GLM-5.1, and Qwen3.6 has moved AI Actions from demoware into production, but you only know it is working in your context if you measure end-to-end completion rather than just intent capture.
Track this funnel: intent recognized → tool called → tool succeeded → user confirmed outcome. Drop-offs at each stage point at different problems: a recognition gap means the prompt or routing layer needs work, a tool-call failure means the integration is brittle, a confirmation failure means the user did not understand what just happened.
Containment rate
The percentage of conversations that ended without ever being handed off to a human. Containment is not the same as resolution - a contained conversation might end because the user gave up. Read it alongside CSAT and resolution to make sure you are not just suppressing escalations. The right containment number for most teams is 60–80%; chasing higher than that usually trades quality for the metric.
Channel metrics: where does it land best
Berrydesk agents deploy to a website widget, in-product surfaces, Slack, Discord, WhatsApp, email, and more. Once you have more than one channel live, you need channel-broken-down versions of every metric above, because the failure modes differ.
Channel engagement and volume
Some channels are inherently asynchronous. A WhatsApp conversation might span three days; a website widget conversation rarely lasts five minutes. Comparing raw volume across channels is misleading without normalization. Compare engagement rate, resolution rate, and CSAT on a per-channel basis, and let raw volume tell you about reach.
Channel-specific resolution
The same agent backed by the same knowledge base will often resolve at different rates on different channels. Slack users typically write longer, more technical messages and get higher resolution rates. WhatsApp users send shorter, more colloquial messages and the agent has to do more work to disambiguate. If your resolution rate on WhatsApp is twenty points below your widget number, the fix is usually a channel-specific system prompt that handles the messier input style - not a model change.
Channel-specific satisfaction
A satisfaction score on a public-facing widget is a different beast from one inside a Slack workspace where every user is signed in and accountable. Read each channel's satisfaction against itself over time, not against the others.
Handoff quality by channel
If the agent escalates, how clean is the handoff. Does the human get the full transcript, the customer's identity, the relevant order or account, and a one-line summary of what was tried. Channels vary in how easy this is to do well. A poor handoff is a worse experience than no AI at all, so this is worth measuring explicitly: percentage of escalations where the human's first reply was repeating something the agent already asked.
Cost metrics: the unit economics nobody wanted to think about until they had to
This is the bucket that has changed the most since 2024. The cost story for AI support has flipped from "is this affordable" to "where do we want to spend." Here are the metrics that let you make that choice deliberately.
Cost per conversation and cost per resolution
Compute the all-in cost of a conversation: model inference, retrieval, any tool calls, and platform overhead. Then divide by total conversations for cost per conversation, and by resolved conversations for cost per resolution. The latter is the honest unit economic number, because an unresolved conversation that still costs you tokens is pure waste.
For a typical Berrydesk deployment routing routine traffic to DeepSeek V4 Flash or MiniMax M2, cost per resolution lands in the single-digit cents. For a deployment that defaults every conversation to Claude Opus 4.7 or GPT-5.5 Pro, it can be 20–50x higher. Both can be the right answer depending on your traffic - but only if you are looking at the number.
Model routing efficiency
Once you accept that different conversations deserve different models, the next question is whether your routing is working. Track what percentage of conversations are handled by your low-cost default model, what percentage escalate to the frontier model, and how the resolution rate compares between the two tiers. The shape you want is a high default-share with similar resolution rates across tiers - that means your router is sending only the genuinely hard conversations up the cost curve. If your frontier-tier resolution is dramatically higher, your default model needs better training or your router is too conservative; if it is dramatically lower, you are spending money escalating things that did not need it.
Deflection value
Estimate how much the agent saved you by handling conversations a human otherwise would have. The naive math is resolved conversations × loaded cost per human-handled conversation. The honest math discounts heavily for conversations a human would never have seen anyway - late-night curiosity questions, FAQ lookups, product research from anonymous visitors. Be generous on the discount. A defensible deflection number is one your CFO will not pull apart.
Knowledge base efficiency
How many fallbacks and unresolved conversations point at the same handful of missing knowledge base entries. With a 1M-token context window now standard on Claude Opus 4.6, Sonnet 4.6, and DeepSeek V4, RAG has shifted from a hard requirement to a tuning lever. You can stuff more of the knowledge base directly into context, which makes "where did the agent miss" a more direct question: it usually means the source content is wrong, missing, or contradictory, not that retrieval failed.
Common pitfalls to watch for
A few traps to avoid as you build out the dashboard:
Optimizing for containment over resolution. It is easy to push containment up by making the agent more reluctant to escalate. That makes the number look great and the customers angrier. Always tie containment to a CSAT and resolution check.
Reading CSAT in isolation. Daily CSAT is noisy. Weekly or monthly is more meaningful. And always read it alongside qualitative feedback - a 4.2 average with one rage-quit per day is a different problem than a 4.2 average with mild lukewarm feedback across the board.
Trusting LLM-judged metrics blindly. Even the best judge models hallucinate. Spot-check 2% of conversations manually every week, especially the ones the judge marked as "resolved." It is a small habit that catches scoring drift before it becomes embarrassing.
Ignoring the long tail of fallbacks. Most teams fix the top three fallback intents and stop. The next twenty intents in aggregate often dwarf the top three. Set a recurring review cadence for the fallback log.
Comparing models without holding the prompt steady. A common mistake is "we tried GPT-5.5 and Claude Opus 4.7 and Claude won." Often, what actually happened is that the prompt happened to be better tuned for one. Hold the prompt, the knowledge base, and the routing constant when you A/B models, and run for at least a week of real traffic before drawing a conclusion.
Forgetting the evaluation set. Build a stable set of 100–300 representative conversations with known-good answers. Run it against your agent on every meaningful change - model swap, prompt update, knowledge base refresh. This catches regressions that production metrics will only show you days later.
Open-weight vs frontier vs routed: a measurement-driven choice
One reason cost metrics matter so much in 2026 is that you now have three meaningfully different deployment shapes, and your analytics are how you choose between them.
Frontier-default. Every conversation goes to GPT-5.5 Pro, Claude Opus 4.7, or Gemini 3.1 Ultra. Highest quality ceiling, highest cost, simplest to operate. Right for low-volume, high-stakes traffic - enterprise sales, complex billing, anything where one bad answer is expensive.
Open-weight-default. Every conversation goes to DeepSeek V4, GLM-5.1, Qwen3.6, or MiniMax M2. Lowest cost, often very strong on routine support, and Apache- or MIT-licensed variants make on-prem and air-gapped deploys viable for regulated industries. Right for high-volume, mostly-routine traffic.
Routed. A cheap, fast model handles the first turn or two; a frontier model is invoked when the conversation looks hard, when an AI Action is being executed, or when the cheap model is uncertain. Most production deployments end up here. The metric that tells you whether your routing is good is the comparison of resolution rate and CSAT across tiers - they should be close, with the frontier tier handling a small but consequential slice.
You cannot pick between these shapes on intuition. You pick between them by looking at the numbers - cost per resolution, resolution rate by tier, CSAT delta - and letting the data tell you what your traffic actually wants.
Choosing analytics tooling
The right analytics platform for an AI support agent in 2026 looks fairly different from the chatbot dashboards of a few years ago. A few features to insist on:
Conversation-level visibility, not just aggregates. You need to be able to open a specific conversation, read the full transcript, see which model answered each turn, see which tools were called and what they returned, and see the resolution and satisfaction signals attached. Aggregates without drill-down are a dead end.
Per-model and per-route breakdowns. If you are routing across DeepSeek V4 Flash, Claude Opus 4.7, GPT-5.5, and Kimi K2.6, your analytics need to show you each leg's behavior independently. A single "the agent did X" view is not enough.
Real-time and retrospective both. Real-time matters for alerting on regressions - fallback spike, latency outage, integration broken. Retrospective matters for the weekly review where you actually improve things.
Customer-grain joins. You should be able to join conversation analytics to customer attributes - plan, tenure, NPS, ARR - to see which segments are getting served well and which are not.
Action-readiness. The best analytics tools surface specific recommendations: "these five intents account for 40% of unresolved conversations, and these source documents do not cover them." Berrydesk is built around this loop, but the principle holds whatever you use.
Putting it into practice
Tracking metrics is the easy part. Building a habit of acting on them is what separates AI support deployments that compound from ones that plateau.
A simple cadence that works for most teams:
Daily: glance at volume, fallback rate, CSAT, and any error alerts. Five minutes. The point is to notice anomalies, not to fix anything.
Weekly: review the fallback log, sample fifteen to twenty unresolved conversations, update the knowledge base or prompts, and look at cost per resolution against last week. Forty-five minutes for one person.
Monthly: rerun your evaluation set against the current agent, look at the trendlines on every metric in this guide, and have a real conversation about whether the routing setup still matches the traffic shape. Two hours for the team.
Quarterly: take a hard look at your model lineup. The frontier moves fast - at the time of writing, DeepSeek V4 and Kimi K2.6 are weeks old, GLM-5.1 is barely two months out, GPT-5.5 launched in April. Whatever your routing decisions were six months ago, they probably want a refresh.
Done consistently, this loop turns an AI agent from a thing you launched into a thing that improves every month - measurably, defensibly, and at a unit cost that gets harder to argue with as the open-weight ecosystem keeps pushing prices down.
If you want a platform that ships these analytics natively - resolution and fallback tracking, per-model cost breakdowns, conversation-level transcripts, AI Actions instrumentation, and channel-by-channel views across web, Slack, Discord, and WhatsApp - Berrydesk is built for exactly this loop. Pick a model, point it at your docs and tools, deploy where your customers are, and start measuring on day one. Build your agent at berrydesk.com.
Launch a measurable AI support agent in minutes
- Pick from GPT-5.5, Claude Opus 4.7, Gemini 3.1, DeepSeek V4, Kimi K2.6, and more
- Built-in analytics for resolution, deflection, CSAT, and cost per conversation
Set up in minutes
Chirag Asarpota is the founder of Strawberry Labs, the team behind Berrydesk - the AI agent platform that helps businesses deploy intelligent customer support, sales and operations agents across web, WhatsApp, Slack, Instagram, Discord and more. Chirag writes about agentic AI, frontier model selection, retrieval and 1M-token context strategy, AI Actions, and the engineering it takes to ship production-grade conversational AI that customers actually trust.



