Claude Opus 4.7 in Production Support: What Anthropic's...

Anthropic's Claude has spent the last three years quietly becoming the default model for teams that care about long documents, careful writing, and tool use that actually works. As of May 2026, the family is led by Claude Opus 4.7 for hard reasoning and complex coding, and Claude Sonnet 4.6 as the everyday workhorse - both shipping with a 1M-token context window at no surcharge.

This post is a practical look at what Claude does well right now, where it lags, and how it stacks up against the rest of the 2026 field - GPT-5.5, Gemini 3.1, and the open-weight wave coming out of DeepSeek, Moonshot, Z.ai, Alibaba, MiniMax, and Xiaomi. We'll close on what all of that means if you're picking a model to run a real customer-support agent.

What Claude is in 2026

The Claude line has converged on a clean split. Opus 4.7 is the flagship - slow, expensive per token, and genuinely strong at multi-step problems. It currently leads SWE-bench Pro at 64.3%, the practical benchmark for whether a model can actually finish complex software-engineering tasks rather than just look smart on isolated questions. Sonnet 4.6 is the default daily-driver: faster, cheaper, and tuned to be the model you'd happily put behind ninety percent of an agent's traffic. Both inherit the 1M-token context window that Anthropic rolled out across the line, which means an entire policy manual, a year of conversation history, and a product catalogue all fit in a single call.

There's also Haiku at the small end for latency-sensitive workloads, but most teams settle on Sonnet 4.6 for volume and Opus 4.7 for the cases that need it.

What it's good at

Long-context reasoning. A 1M-token window changes how you build with the model. You can drop an entire knowledge base, the customer's full ticket history, and your refund policy into a single prompt and let the model reason over the lot. Retrieval becomes a tuning lever for cost, not a hard requirement.
Tool use and agentic workflows. Opus 4.7's tool-calling is reliable enough to run real production actions - bookings, refunds, lookups, payment flows - without the brittleness that made earlier "agent" demos feel like party tricks.
Writing and editing. Claude still produces the most controlled, on-brand prose of the closed-frontier models. It's the one most teams use for help-centre rewrites and structured customer replies.
Code. SWE-bench Pro at 64.3% is meaningful: it means Opus 4.7 can finish realistic engineering tasks end-to-end. For a support agent that has to inspect a webhook payload or generate a working API snippet, that translates into real reliability.
Safety and steerability. Anthropic's training approach makes Claude noticeably easier to keep on-policy. For regulated industries - finance, healthcare, insurance - that's not a soft preference, it's a procurement requirement.

Where it falls short

Live web access isn't core. Claude can be wired to web tools, but it isn't a browse-first product. If your use case is real-time research or breaking-news synthesis, Gemini 3.1 Ultra is the more natural pick.
Image and video generation. Claude reads images, charts, and diagrams competently, but it doesn't generate images or video. GPT-5.5 and Gemini 3.1 cover that ground better.
Price per token at the top. Opus 4.7 is premium-priced. For high-volume, low-complexity traffic - order status, password resets, shipping FAQs - you'll want a cheaper model in front of it.

Claude vs. the 2026 field

Claude vs. GPT-5.5

GPT-5.5 and GPT-5.5 Pro (with parallel reasoning) launched in April 2026 and remain the broadest-capability product, especially when you factor in image generation, voice, and the surrounding tool ecosystem. On pure SWE-Pro coding, Claude Opus 4.7 holds the lead. On general-purpose breadth and multimodal output, GPT-5.5 wins. For customer support specifically, the practical difference often comes down to which model your team finds easier to keep on-script - and Claude tends to win that contest.

Claude vs. Gemini 3.1

Gemini 3.1 Ultra ships with a 2M-token context window - twice Claude's - and is natively multimodal across text, image, audio, and video. Gemini 3.1 Pro leads GPQA Diamond at 94.3%, the strongest score among closed-frontier models on graduate-level reasoning. If you need to ingest call recordings, screen captures, or video walkthroughs, Gemini is the cleaner fit. If you need careful, controlled written responses with strong tool use, Claude is.

Claude vs. the open-weight frontier

This is where the picture has changed most since 2024. A serious open-weight tier now exists, and for a customer-support deployment it's the cost story:

DeepSeek V4 (April 2026) - V4 Pro is a 1.6T-param MoE with 49B active; V4 Flash is a 284B/13B-active variant. Both ship with a 1M context. V4 Flash is priced at $0.14 / $0.28 per million input/output tokens and is open source, which makes it the default cheap-tier model for routing routine tickets.
Moonshot Kimi K2.6 (April 2026) - 1T-param agentic-first MoE that can run 12-hour autonomous coding sessions and coordinate up to 300 sub-agents across 4,000 steps. 58.6 on SWE-Pro. Open weights.
Z.ai GLM-5.1 (April 2026) - 754B MoE under MIT license, scoring 58.4 on SWE-Pro and beating Claude Opus 4.6 (57.3) on that benchmark. Trained on Huawei Ascend chips, no Nvidia.
Alibaba Qwen 3.6 - the dense 27B variant under Apache 2.0 outperforms 397B-param MoE rivals on agentic coding benchmarks. Strong local-deploy story.
MiniMax M2 / M2.7 - 230B/10B-active open-weight MoE, priced at roughly 8% of Claude Sonnet at twice the speed. M2.7 hits 56.22% SWE-Pro.
Xiaomi MiMo-V2-Pro - >1T total / 42B active, 1M context, MIT-licensed weights.

Claude Opus 4.7 still leads on absolute SWE-Pro and on the kind of nuanced writing that closed frontier models do best. But the open-weight tier has caught up enough that running everything through Opus is no longer the obvious move - even when budget isn't an issue, Sonnet 4.6 plus a routed open-weight model often beats Opus-everywhere on latency and cost without losing meaningful quality.

What this means for customer support

If you're choosing a single model for a support agent, Claude Sonnet 4.6 is one of the strongest defaults in the market. It's fast enough for chat, smart enough for nuanced policy questions, and Anthropic's safety training makes it forgiving when a customer phrases something in an unexpected way.

But "single model" is increasingly the wrong frame. The better pattern in 2026 is routing: send routine, high-volume traffic to a cheap open-weight model like DeepSeek V4 Flash or MiniMax M2.7, escalate to Sonnet 4.6 for anything that needs careful judgement, and reserve Opus 4.7 for edge cases - long policy reasoning, complex troubleshooting, anything where a wrong answer is expensive. That gives you the cost profile of an open model on the bulk of your traffic and the reliability of frontier closed models on the cases that matter.

Common pitfalls when deploying Claude for support

Treating long context as a substitute for retrieval. A 1M window is a powerful tool, but stuffing every document into every call is wasteful and slow. Use long context for the conversation, the active policy, and the customer's history; keep the rest of your knowledge base in retrieval.
Picking Opus where Sonnet is enough. Opus 4.7 is the sharpest model in the family, but Sonnet handles most support volume at a fraction of the price. Default to Sonnet, escalate to Opus on signal.
Ignoring tool-call evaluation. Agentic tool use is reliable now, but "reliable" still means you need evals. Before you put bookings or refunds in the model's hands, run a few hundred edge cases through it and measure.
One model for every channel. Voice, web chat, and async email have different latency and quality budgets. The right model for each is rarely the same one.

Where Berrydesk fits

Reviews of individual models are useful, but most support teams don't actually want to pick one. They want their agent to use whichever model fits the moment.

Berrydesk is built around that idea. You launch a branded support agent in four steps - pick a model, train it on your docs, websites, Notion, Google Drive, or YouTube, brand the widget, and deploy to your site, Slack, Discord, or WhatsApp - and the model layer is open. You can run on Claude Opus 4.7 or Sonnet 4.6 if Anthropic's safety profile and writing quality are what you care about. You can route routine traffic to DeepSeek V4 Flash, MiniMax M2.7, or Qwen3.6 to drive cost down. You can keep Gemini 3.1 in reserve for multimodal cases or GPT-5.5 for breadth. AI Actions handle the bookings, refunds, and lookups; the model picks the words.

If you've been waiting for the model field to settle before deploying - it has, in the sense that you no longer have to bet on one. Build your agent for free and try Claude alongside the rest.

What Claude is in 2026

There's also Haiku at the small end for latency-sensitive workloads, but most teams settle on Sonnet 4.6 for volume and Opus 4.7 for the cases that need it.

What it's good at

Long-context reasoning. A 1M-token window changes how you build with the model. You can drop an entire knowledge base, the customer's full ticket history, and your refund policy into a single prompt and let the model reason over the lot. Retrieval becomes a tuning lever for cost, not a hard requirement.
Tool use and agentic workflows. Opus 4.7's tool-calling is reliable enough to run real production actions - bookings, refunds, lookups, payment flows - without the brittleness that made earlier "agent" demos feel like party tricks.
Writing and editing. Claude still produces the most controlled, on-brand prose of the closed-frontier models. It's the one most teams use for help-centre rewrites and structured customer replies.
Code. SWE-bench Pro at 64.3% is meaningful: it means Opus 4.7 can finish realistic engineering tasks end-to-end. For a support agent that has to inspect a webhook payload or generate a working API snippet, that translates into real reliability.
Safety and steerability. Anthropic's training approach makes Claude noticeably easier to keep on-policy. For regulated industries - finance, healthcare, insurance - that's not a soft preference, it's a procurement requirement.

Where it falls short

Live web access isn't core. Claude can be wired to web tools, but it isn't a browse-first product. If your use case is real-time research or breaking-news synthesis, Gemini 3.1 Ultra is the more natural pick.
Image and video generation. Claude reads images, charts, and diagrams competently, but it doesn't generate images or video. GPT-5.5 and Gemini 3.1 cover that ground better.
Price per token at the top. Opus 4.7 is premium-priced. For high-volume, low-complexity traffic - order status, password resets, shipping FAQs - you'll want a cheaper model in front of it.

Claude vs. the 2026 field

Claude vs. GPT-5.5

Claude vs. Gemini 3.1

Claude vs. the open-weight frontier

This is where the picture has changed most since 2024. A serious open-weight tier now exists, and for a customer-support deployment it's the cost story:

DeepSeek V4 (April 2026) - V4 Pro is a 1.6T-param MoE with 49B active; V4 Flash is a 284B/13B-active variant. Both ship with a 1M context. V4 Flash is priced at $0.14 / $0.28 per million input/output tokens and is open source, which makes it the default cheap-tier model for routing routine tickets.
Moonshot Kimi K2.6 (April 2026) - 1T-param agentic-first MoE that can run 12-hour autonomous coding sessions and coordinate up to 300 sub-agents across 4,000 steps. 58.6 on SWE-Pro. Open weights.
Z.ai GLM-5.1 (April 2026) - 754B MoE under MIT license, scoring 58.4 on SWE-Pro and beating Claude Opus 4.6 (57.3) on that benchmark. Trained on Huawei Ascend chips, no Nvidia.
Alibaba Qwen 3.6 - the dense 27B variant under Apache 2.0 outperforms 397B-param MoE rivals on agentic coding benchmarks. Strong local-deploy story.
MiniMax M2 / M2.7 - 230B/10B-active open-weight MoE, priced at roughly 8% of Claude Sonnet at twice the speed. M2.7 hits 56.22% SWE-Pro.
Xiaomi MiMo-V2-Pro - >1T total / 42B active, 1M context, MIT-licensed weights.

What this means for customer support

Common pitfalls when deploying Claude for support

Treating long context as a substitute for retrieval. A 1M window is a powerful tool, but stuffing every document into every call is wasteful and slow. Use long context for the conversation, the active policy, and the customer's history; keep the rest of your knowledge base in retrieval.
Picking Opus where Sonnet is enough. Opus 4.7 is the sharpest model in the family, but Sonnet handles most support volume at a fraction of the price. Default to Sonnet, escalate to Opus on signal.
Ignoring tool-call evaluation. Agentic tool use is reliable now, but "reliable" still means you need evals. Before you put bookings or refunds in the model's hands, run a few hundred edge cases through it and measure.
One model for every channel. Voice, web chat, and async email have different latency and quality budgets. The right model for each is rarely the same one.

Where Berrydesk fits

Reviews of individual models are useful, but most support teams don't actually want to pick one. They want their agent to use whichever model fits the moment.

If you've been waiting for the model field to settle before deploying - it has, in the sense that you no longer have to bet on one. Build your agent for free and try Claude alongside the rest.

Claude Opus 4.7 in Production Support: What Anthropic's Flagship Does Best in 2026

What Claude is in 2026

What it's good at

Where it falls short

Claude vs. the 2026 field

Claude vs. GPT-5.5

Claude vs. Gemini 3.1

Claude vs. the open-weight frontier

What this means for customer support

Common pitfalls when deploying Claude for support

Where Berrydesk fits

Run Claude - and every other frontier model - through one support agent

Keep reading

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1: Which Frontier Model Should Power Your Support Agent?

The Best LLMs for Customer Support in 2026: A Practical Buyer's Guide

Building With the DeepSeek V4 API: A Practical Guide for Support Teams

Claude Opus 4.7 in Production Support: What Anthropic's Flagship Does Best in 2026

What Claude is in 2026

What it's good at

Where it falls short

Claude vs. the 2026 field

Claude vs. GPT-5.5

Claude vs. Gemini 3.1

Claude vs. the open-weight frontier

What this means for customer support

Common pitfalls when deploying Claude for support

Where Berrydesk fits

Run Claude - and every other frontier model - through one support agent

Keep reading

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1: Which Frontier Model Should Power Your Support Agent?

The Best LLMs for Customer Support in 2026: A Practical Buyer's Guide

Building With the DeepSeek V4 API: A Practical Guide for Support Teams