
The "big three" closed-source assistants - OpenAI's ChatGPT, Anthropic's Claude, and Google's Gemini - have spent the last three years trading places at the top of the leaderboard. In May 2026, the matchup looks very different than it did at launch. GPT-5.5 and GPT-5.5 Pro have introduced parallel reasoning. Claude Opus 4.7 is winning the hardest software-engineering benchmarks. Gemini 3.1 Ultra ships with a 2M-token context window and native video understanding. And open-weight challengers like DeepSeek V4, Kimi K2.6, GLM-5.1, Qwen 3.6, and MiniMax M2.7 are pulling the price floor out from under all of them.
If you are choosing a model to power a customer-facing chatbot - whether it's a sales agent, a support bot, or an internal knowledge assistant - the differences between the frontier three matter. They handle reasoning differently. They write code differently. They have very different opinions about what counts as a "natural" tone. And they cost different amounts to run at scale.
This piece walks through how the current generation of GPT, Claude, and Gemini compare across the four jobs that come up most in production support work: logical reasoning, common-sense judgement, coding, and creative writing. We finish with how Berrydesk lets you mix and match - including the open-weight options - instead of marrying a single vendor.
Logical Reasoning: Where the Models Earn Their Keep
A surprising amount of customer-support work is logic in disguise. "If my order arrived damaged, but I bought it as a gift and the receipt was sent to my address, who needs to file the return?" That is a riddle written in business clothes. The model that handles classic logic puzzles cleanly tends to handle this kind of policy reasoning cleanly too.
We tried the same brain-teaser on all three: "How is it possible for a doctor's son's father not to be a doctor?" The trick is that nothing in the puzzle says the father of the son has to be the doctor in question - the doctor could be the mother.
Claude Opus 4.7 walked through the reasoning step by step, surfaced the implicit gender bias the puzzle is designed to expose, and explained why the puzzle is hard before giving the answer. That extra layer - naming the trap rather than just sidestepping it - is genuinely useful when an agent is helping a confused customer untangle a refund eligibility question.
GPT-5.5 took a tighter path. It identified the answer, gave a concise justification, and stopped. With GPT-5.5 Pro we noticed something different: the parallel reasoning surfaces multiple candidate framings before committing, which makes it especially strong on policy questions where two interpretations both look reasonable.
Gemini 3.1 Pro matched both on accuracy and was the most compact of the three. It states the answer cleanly. Gemini 3.1 Ultra, with its 2M-token context, does noticeably better on chains of reasoning that have to thread through long policy documents - a place where smaller windows still force a RAG-style chunking compromise.
After running a dozen logic puzzles of escalating difficulty, the takeaway: all three are reliably correct on standard logic now. The differentiator is how they communicate the reasoning. Claude tends to teach. GPT-5.5 tends to deliver. Gemini tends to compress. For a support context, the right pick depends on whether your end users want the answer or want to understand the answer.
Common-Sense Reasoning: Catching the Trick Question
Logical reasoning measures how well a model navigates a well-posed problem. Common sense measures whether it notices the problem is malformed. Customers ask malformed questions constantly - they leave out details, assume context, or describe a scenario that contradicts itself.
The classic stress test is a riddle like: "If a spaceship from Mars breaks into two parts, with one part crashing in the Atlantic Ocean off Brazil and the other in the Pacific Ocean off Japan, where do you bury the survivors?" The right answer, of course, is that you don't bury survivors.
GPT-5.5 caught the trap immediately and answered briefly: survivors aren't buried. It is the most efficient of the three at this kind of question - no preamble, no overlong explanation.
Claude Opus 4.7 also caught the trap, and additionally flagged that the question appears engineered to push a reader toward debating coordinates and jurisdictions. That meta-awareness - naming the manipulation rather than just refusing it - has been a steady improvement across the Claude 4.x line and is the cleanest version of it we've seen.
Gemini 3.1 Pro produced the most thorough answer of the three: it identified the trick, then went further to question the premise itself (humans surviving an interplanetary crash and reentry), and offered a more sensible reframe. That sort of "let me question the question" behavior is useful when a customer says something like "my subscription cancelled itself but I'm still being charged" - the agent needs to push back politely and pull more information.
In practice, the gap on common sense between the frontier three has nearly closed compared to a year ago. What's left is stylistic: GPT-5.5 is terse, Claude is teacherly, Gemini is investigative. None of them get embarrassed by basic gotchas anymore.
Coding: From Toy Apps to Real Repos
Coding ability matters even for chatbots that never write a line of code, because the same skill - holding a structured task in working memory and reasoning over many interrelated parts - is what makes an AI Action work. When a Berrydesk agent has to look up an order, check eligibility against a policy, call a refund API, and then notify a Slack channel, that is a chain of reasoning that benefits from the same training signal that produces good code.
We ran two tests. First, a from-scratch task: "Build a small React to-do app with persistence, undo, and keyboard shortcuts." Second, a repository task: hand the model a half-dozen Python files from a working project and ask it to identify exactly which lines need to change to add a new feature without breaking existing tests.
On the from-scratch task, GPT-5.5 produced a clean, working app on the first try, including the keyboard shortcut behavior that the other two missed. Codex on the GPT-5 stack is now strong enough that "first try compiles and runs" is the expected outcome rather than a pleasant surprise.
Claude Opus 4.7 leads the public SWE-bench Pro at 64.3%, and on visual / front-end work it is still our top pick. Its first attempt at the to-do app needed a small fix on the undo behavior, but the markup, the styling, and the accessibility scaffolding were the most polished of the three. Where Claude really pulls ahead is when the brief is "make this feel good" - the model has visibly better taste in spacing, color, and motion.
Gemini 3.1 Pro held its own, especially when the brief involved video, image, or audio assets - its native multimodality means it can be handed a screenshot of a broken UI and produce a fix without an OCR step. On pure code-from-scratch it is roughly a half-step behind GPT-5.5 and Claude Opus 4.7.
On the multi-file repository task, the picture changes again. With a 1M-token context window now standard on Claude Opus 4.6 and Sonnet 4.6 (no surcharge), and 2M on Gemini 3.1 Ultra, you can paste an entire mid-sized service into the prompt and ask the model to plan the change in-context. That removes most of the friction RAG used to cause for code work. Claude was the strongest at "find every place that needs to change," followed by GPT-5.5; Gemini handled the task but tended to suggest broader rewrites than necessary.
A note on the open-weight side, because it is now genuinely competitive on agentic coding: GLM-5.1 scores 58.4 on SWE-Bench Pro, ahead of both GPT-5.4 (57.7) and Claude Opus 4.6 (57.3). Kimi K2.6 hits 58.6 and can run autonomous coding sessions up to twelve hours. Qwen3.6-27B, a dense Apache-licensed model, beats some 397B-parameter MoE rivals on agentic coding and is small enough to deploy on-prem. For an enterprise that wants AI Actions running inside its own VPC, those numbers are a different conversation than they were even six months ago.
Coding takeaway
- GPT-5.5 - best for greenfield code where you want it to work the first time.
- Claude Opus 4.7 - best for visual / front-end work and for navigating large existing codebases. Top SWE-bench Pro score among closed models.
- Gemini 3.1 Ultra - best when the task involves images, video, or audio in the loop, or when you need a 2M-token window.
- Open-weight (GLM-5.1, Kimi K2.6, Qwen 3.6) - best when cost, latency, or data residency matters more than the last 5% of quality.
Creativity: Voice, Tone, and the "Doesn't Sound Like a Bot" Test
The creativity test most useful for support work isn't "write a poem" - it's "rewrite this canned macro so it doesn't sound like a canned macro." The model that handles this well is the one that will keep your brand voice consistent across thousands of conversations a day.
Our prompt was the standard creative stress test: "Write a rhyming rap song about Elon Musk's life and achievements." It's a useful probe because it forces the model to balance facts, structure, and voice all at once.
GPT-5.5 produced a tight, structurally clean set of verses on the first attempt. Rhyme scheme intact, biographical beats hit accurately, hooks well-placed. If you need creative work that lands inside a rigid container - a tweet, a meta description, a 50-character SMS - GPT-5.5 is the most reliable.
Claude Opus 4.7 wrote the most interesting verses. It had a real point of view on the subject, balanced the achievements against the controversies without caricature, and the phrasing felt like it came from a writer rather than a generator. It needed two passes to lock the rhyme scheme tightly; that's the trade-off - Claude tends to optimize for voice over format, then tighten.
Gemini 3.1 Pro produced a competent, accurate, slightly stiff rap. The facts were right and the rhymes worked, but the wordplay was thinner. Gemini's strength shows up more in long-form analytical writing than in tightly-constrained creative formats.
Across a half-dozen creative tasks - including writing a return-policy email in three different brand voices and rewriting a technical FAQ for a Gen Z audience - the same pattern held:
- GPT-5.5 is the most reliable at hitting a brief on the first try.
- Claude Opus 4.7 produces the most natural, least "AI-sounding" prose, which is why it tends to win blind A/B tests on customer-facing copy.
- Gemini 3.1 is solid and dependable but rarely surprising.
What This Means for a Customer Support Agent
In a support context, you are rarely picking one of these models for "everything." A well-designed agent routes traffic by intent. A simple "where is my order" lookup does not need GPT-5.5 Pro's parallel reasoning; it needs a fast, cheap model that calls one API and answers in under a second. A complex billing dispute that has to weigh policy, history, and tone might genuinely benefit from Claude Opus 4.7. A multilingual escalation that involves attaching screenshots is squarely Gemini 3.1 Ultra territory.
The economics make this routing strategy obvious. DeepSeek V4 Flash runs at $0.14 per million input tokens and $0.28 per million output - fractions of a cent per typical resolution. MiniMax M2.7 is roughly 8% the price of Claude Sonnet at twice the speed, and its agentic loop is strong enough to handle structured tool calls. For high-volume, low-stakes traffic, those models do the job at a cost structure the frontier closed models cannot match. Reserve Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Ultra for the conversations where the stakes justify the spend.
Pitfalls to watch for
A few traps come up over and over when teams pick a model for their support agent:
- Defaulting to the most capable model for every turn. A single misrouted reasoning-heavy model can be 50–100x the cost of a flash variant on routine traffic. Build the routing layer before you scale.
- Confusing context window with memory. A 1M-token window holds a lot, but it does not by itself give the agent persistent memory of past conversations. You still need a real session and history layer.
- Treating long context as a substitute for retrieval. With Claude Opus 4.6 and Sonnet 4.6 at 1M tokens at no surcharge, you can dump a full knowledge base into the prompt - but for any base over a few thousand pages, retrieval still wins on cost and on focus. Long context is a tuning knob, not a replacement.
- Locking into a single vendor. The leaderboard has shuffled three times in the last twelve months. The team that built on a single model in mid-2025 has had to migrate twice already. Build behind an abstraction.
Long Context vs RAG: A Quick Aside
A common question in 2026: with 1M and 2M-token context windows now standard at the frontier, do you still need RAG? The honest answer is "less than you used to, but yes."
Long context handles the case where the entire body of relevant information fits in the window. For a 200-page product manual and a 50-page returns policy, that is now true on Claude Opus 4.6, Sonnet 4.6, GPT-5.5, Gemini 3.1, and DeepSeek V4. You can paste the whole thing in.
RAG still wins when the corpus is genuinely large (a multi-product company with thousands of help articles), when the retrieval has to be auditable (regulated industries that need to log exactly which document grounded a response), or when token cost matters at volume. The pattern that's emerging is hybrid: retrieve aggressively to narrow the candidate set, then drop the top 5–10 documents whole into a long-context model that has the headroom to reason across them. Berrydesk's training pipeline supports both - feed it docs, sites, Notion, Drive, or YouTube and the right strategy is applied per agent.
Open-Weight vs Closed Frontier: The Trade-Off
Three years ago this section did not need to exist. Today it does.
The frontier closed models - GPT-5.5, Claude Opus 4.7, Gemini 3.1 Ultra - still lead on the hardest reasoning tasks and on multimodal work. They have the most polished safety training, the best tool-use orchestration out of the box, and the most mature SDKs.
The open-weight frontier - DeepSeek V4, Kimi K2.6, GLM-5.1, Qwen 3.6, MiniMax M2.7, Xiaomi MiMo-V2-Pro - has closed the gap on most coding and agentic benchmarks, undercuts the closed models on price by an order of magnitude, and offers something the closed models cannot: on-prem and air-gapped deployment under MIT or Apache licenses. For regulated industries - healthcare, finance, government - that's the entire ballgame.
The right answer for a serious support deployment is rarely "one or the other." It's: closed frontier where it earns its price, open-weight where it doesn't, and a routing layer in front so you can rebalance without rewriting the agent.
How Berrydesk Handles the Choice
Berrydesk doesn't make you marry a model. You can pick from GPT-5.5 and 5.5 Pro, Claude Opus 4.7 and Sonnet 4.6, Gemini 3.1 Pro and Ultra, DeepSeek V4 (Pro and Flash), Kimi K2.6, GLM-5.1, Qwen 3.6 (including the open dense and 35B-A3B variants for self-host), and MiniMax M2.7 - and you can change the choice on a per-agent basis without rebuilding anything.
The four-step launch flow is the same regardless of which model you pick. Choose the model. Train the agent on your docs, websites, Notion workspace, Google Drive, or YouTube channel. Brand the chat widget so it matches your site. Wire up AI Actions for bookings, refunds, order lookups, or payment capture. Deploy to your website, Slack, Discord, WhatsApp, or wherever your customers actually are.
The model you pick today is unlikely to be the model you'd pick in twelve months. The point of building on Berrydesk is that you don't have to choose once and live with it.
If you want to see how a routed multi-model agent compares in real traffic, you can spin one up at berrydesk.com - no credit card needed, and you can swap the underlying model whenever the leaderboard shifts again.
Pick the model. We handle the rest.
- Swap between GPT-5.5, Claude Opus 4.7, Gemini 3.1, DeepSeek V4, Kimi K2.6, GLM-5.1, Qwen, and MiniMax in a click.
- Train on your docs, sites, Notion, Drive, or YouTube - deploy to web, Slack, Discord, or WhatsApp.
Set up in minutes
Chirag Asarpota is the founder of Strawberry Labs, the team behind Berrydesk - the AI agent platform that helps businesses deploy intelligent customer support, sales and operations agents across web, WhatsApp, Slack, Instagram, Discord and more. Chirag writes about agentic AI, frontier model selection, retrieval and 1M-token context strategy, AI Actions, and the engineering it takes to ship production-grade conversational AI that customers actually trust.



