Voice AI for Customer Support in 2026: What Real-Time...

When OpenAI shipped the first Realtime API in late 2024, low-latency speech-to-speech with a frontier model felt like a glimpse of the future. Eighteen months later, that future is just how voice AI works. Every major provider - closed and open - now offers real-time audio, sub-second turn-taking, and emotional prosody that most callers cannot tell apart from a human agent.

For customer support teams, this is the inflection point. Voice channels that used to mean an IVR tree and a 12-minute hold queue can now route to an AI agent that understands the question, looks up the order, processes the refund, and books the follow-up call - all in one continuous conversation. This post walks through what voice AI looks like in May 2026, what it means for support, and how to actually ship one without burning a quarter of engineering time.

What changed since the first Realtime API

The original idea was simple: instead of stitching together speech-to-text, an LLM, and text-to-speech as three separate API calls, run a single bidirectional audio stream against a model that natively handles speech. That eliminated the awkward "...processing..." pause and made interruptions feel natural.

Today, that architecture is table stakes:

OpenAI's GPT-5.5 stack powers the current Realtime API with parallel reasoning, meaning the model can think about a tool call (like checking inventory) while the conversation continues. Latency on first audio is well under 300ms in most regions.
Google's Gemini 3.1 Ultra is natively multimodal across text, image, audio, and video - a single model handles a customer who shares a photo of a damaged product mid-call without handing off to anything else. Its 2M-token context lets a long, branching support call carry full session memory.
Anthropic's Claude Opus 4.7 and Sonnet 4.6 ship with a 1M-token context window at no surcharge, plus the strongest tool-use reliability on the market - the combination most production support teams want when AI Actions are doing real work.
Open-weight contenders have caught up faster than most expected. DeepSeek V4 Flash at $0.14/$0.28 per million input/output tokens makes voice-channel routing economics work even at high volume. Moonshot's Kimi K2.6 ships with native video input and agentic loops that can run for hours. Z.ai's GLM-5.1 and Alibaba's Qwen 3.6 family bring MIT/Apache-licensed weights that can run fully on-prem for regulated industries.

The practical takeaway: a voice support agent in 2026 is not a single-model decision. It is a routing decision, where simple intents go to a fast open-weight model and complex escalations get the frontier reasoning they actually need.

What a voice agent can actually do for support

The vocabulary has shifted from "voice assistant" to "voice agent" for a reason. The earlier generation could answer questions; the current generation can act on them.

Resolve, don't just respond

A 2024-era voice bot might tell a customer "I see your order shipped on the 12th." A 2026 voice agent on Berrydesk, configured with AI Actions, can pull the order from your commerce platform, see that it was delayed, apologize specifically for the delay, offer a discount code that fits your refund policy, apply it to the customer's account, and email a confirmation - all inside a 90-second call. The model handling that flow needs reliable tool-use, which is exactly where Claude Opus 4.7, GPT-5.5, Kimi K2.6, and Qwen 3.6 are now solid.

Hold context across long sessions

A B2B support call can run 20 minutes and reference six previous tickets. With 1M-token contexts standard across the major models - and 2M on Gemini 3.1 Ultra - the agent can hold the entire customer history, your policy documents, and the current conversation in-context. RAG is no longer mandatory for medium-sized knowledge bases; it becomes a tuning lever you reach for when you need pinpoint citation, not a default architecture.

Hand off cleanly

The hardest part of voice AI used to be the handoff. The agent would transfer to a human, the human would get nothing but a phone number, and the customer would re-explain everything. Modern voice agents on Berrydesk pass a structured summary, the relevant documents the agent looked at, the actions already taken, and the customer's emotional state to the live agent's screen before the call connects.

Speak the customer's language

Multilingual support has been a perennial promise. With Gemini 3.1 Ultra and GPT-5.5 fluent across most major languages and dialects in real time, and open-weight models like Qwen 3.6 and DeepSeek V4 strong in Asian languages specifically, support teams can offer 24/7 service in 30+ languages without staffing up regional teams.

Use cases that are actually production-ready in 2026

Ecommerce and retail

A direct-to-consumer skincare brand running on Berrydesk routes voice traffic from its support line to an agent trained on the product catalog, fulfillment policies, and ingredient FAQs. Routine calls - "Where is my order?", "Is this safe during pregnancy?", "I want to change my subscription frequency" - never reach a human. The agent uses MiniMax M2 for the bulk of calls (roughly 8% the cost of Claude Sonnet at 2x the speed) and escalates to Claude Opus 4.7 when the call hits returns over $200 or any mention of an allergic reaction.

SaaS support

A mid-market data tool routes voice calls from paying customers to a Berrydesk agent trained on its docs, GitHub issues, and the last 18 months of resolved tickets. The agent uses GPT-5.5 Pro with parallel reasoning to investigate the customer's account state while the conversation continues, then either resolves the bug, files a ticket with the engineering team, or - if it can tell the customer is frustrated - books a same-day call with a senior engineer.

Healthcare and regulated industries

A regional hospital network needed a voice triage line that could not send any data to a US cloud provider. With Berrydesk's open-weight routing, they deployed Qwen 3.6-27B and GLM-5.1 on their own infrastructure, both under permissive licenses. The agent screens incoming calls, books appointments, answers insurance questions, and never leaves the hospital's network.

Field services

A commercial HVAC company put a voice agent on its dispatch line. Technicians call in from rooftops with greasy hands, describe the unit they are looking at, and the agent - running Gemini 3.1 Ultra for its multimodal capability - accepts a quick video clip the technician streams from their phone, identifies the model, pulls the service manual, and walks them through the fix. The same agent handles homeowner calls scheduling service.

Trade-offs and pitfalls to plan for

Latency budgets are unforgiving

Text chat tolerates a one-second pause. Voice does not. Anything past 400ms of silence reads as awkward, and past 800ms reads as broken. If you route to an open-weight model on your own GPUs to save cost, measure first-audio latency under realistic load, not just average. A model that costs 90% less but pauses for 1.5 seconds will feel worse than a more expensive frontier model that streams immediately.

Tool calls during a call

When an agent has to look up an order or process a refund mid-call, you have a choice: keep the customer on hold, or have the model fill the silence ("Let me pull that up for you..."). Models with parallel reasoning - GPT-5.5 Pro, Claude Opus 4.7 - handle this gracefully. Smaller models tend to either freeze or talk awkwardly past the tool result. Test specifically for this.

Hallucinations on policy

Voice removes the customer's ability to scroll back and check a citation. If the agent invents a refund policy, the customer will believe it. Two defenses matter most: keep the actual policy document in the agent's context (which is now cheap with 1M-token windows), and use the model's tool-use to confirm any commitment ("I am applying a 15% discount, code SORRY15") rather than asserting it from memory.

Realistic synthetic voices are a regulatory and trust question, not a technical one. Some jurisdictions now require disclosure that the customer is speaking with AI. Bake the disclosure into the agent's opening, and respect any request to be transferred to a human without making the customer fight for it.

When to keep humans on the call

Voice AI handles the long tail of routine calls beautifully. It is still the wrong choice for emotionally charged conversations - bereavement claims, account closures, complaints that have already escalated. Configure your agent to detect those signals (it can) and route to a human immediately.

How to ship this on Berrydesk

The setup is the standard four steps, with voice as one of the deploy targets:

Pick a model. Start with Claude Sonnet 4.6 or GPT-5.5 if you want a frontier baseline. If cost or sovereignty matter, route the bulk of traffic to DeepSeek V4 Flash, MiniMax M2, or Qwen 3.6, and reserve the expensive models for escalations.
Train it on your knowledge. Point Berrydesk at your help docs, your website, your Notion workspace, your Google Drive, or your YouTube tutorials. The agent indexes them and keeps them fresh.
Add AI Actions. Wire up the actions that turn a conversation into a resolution: order lookup, refund, appointment booking, payment, escalation to a human. These are the difference between a chatbot and an agent.
Deploy. The voice channel sits next to your website widget, Slack, Discord, and WhatsApp. The same agent definition handles every channel - train once, answer everywhere.

Voice is no longer a separate product or a separate model. It is a deploy target on the same agent you have already built.

If you have been waiting for voice AI to be ready, it is. The interesting question is no longer whether the technology works - it is which conversations you trust to it first, and how cleanly the rest hand off to the humans who should be having them. Try Berrydesk for free and have a voice-ready support agent live by the end of the day.

What changed since the first Realtime API

Today, that architecture is table stakes:

OpenAI's GPT-5.5 stack powers the current Realtime API with parallel reasoning, meaning the model can think about a tool call (like checking inventory) while the conversation continues. Latency on first audio is well under 300ms in most regions.
Google's Gemini 3.1 Ultra is natively multimodal across text, image, audio, and video - a single model handles a customer who shares a photo of a damaged product mid-call without handing off to anything else. Its 2M-token context lets a long, branching support call carry full session memory.
Anthropic's Claude Opus 4.7 and Sonnet 4.6 ship with a 1M-token context window at no surcharge, plus the strongest tool-use reliability on the market - the combination most production support teams want when AI Actions are doing real work.
Open-weight contenders have caught up faster than most expected. DeepSeek V4 Flash at $0.14/$0.28 per million input/output tokens makes voice-channel routing economics work even at high volume. Moonshot's Kimi K2.6 ships with native video input and agentic loops that can run for hours. Z.ai's GLM-5.1 and Alibaba's Qwen 3.6 family bring MIT/Apache-licensed weights that can run fully on-prem for regulated industries.

What a voice agent can actually do for support

The vocabulary has shifted from "voice assistant" to "voice agent" for a reason. The earlier generation could answer questions; the current generation can act on them.