What is gpt-realtime-1.5 by OpenAI?

An upgraded real-time voice model from OpenAI that makes AI agents more reliable at following instructions, using tools, and speaking multiple languages.

What are the main features of gpt-realtime-1.5 by OpenAI?

The main features of gpt-realtime-1.5 by OpenAI include: Enhanced Tool Calling (+25%), Optimized Transcription Accuracy (+10.23%), Asynchronous Function Calling, Direct SIP Connection, Placeholder Responses.

How much does gpt-realtime-1.5 by OpenAI cost?

Audio input $32/1M, output $64/1M; Text input $4/1M, output $16/1M; Cached input $0.40/1M.

Who is gpt-realtime-1.5 by OpenAI for?

Developers building voice AI products, enterprise AI customer service teams, and international product owners needing multilingual voice interaction.

What are the alternatives to gpt-realtime-1.5 by OpenAI?

Alternatives to gpt-realtime-1.5 by OpenAI include: ElevenLabs Agents (superior voice quality), Deepgram Agent (low latency/cost), Vapi (orchestration middleware), Qwen3-Omni (open source)..

gpt-realtime-1.5: The "Get to Work" Version of OpenAI's Voice Agents

2026-02-27 | ProductHunt | Official Site | API Docs

OpenAI Realtime API

OpenAI Realtime API Official Image — The "Agent online" interface, signaling the voice agent is ready.

30-Second Quick Judgment

What is it?: An upgraded version of OpenAI's real-time voice model that makes AI voice agents more reliable at following instructions, calling tools, and speaking multiple languages. Simply put, it makes your AI customer service calls much less "clunky."

Is it worth your attention?: If you're building voice AI products, absolutely. gpt-realtime-1.5 offers a substantial improvement in tool-calling reliability (up over 25%), which was the biggest complaint from developers. If you're just a casual user, this update won't affect you much—it's a pure API product for developers.

Three Key Questions

Does it matter to me?

Target Audience: Developers and enterprises building or planning to build voice AI products. Specifically:

Teams building AI call center systems.
Developers creating voice assistants (e.g., smart ordering, appointment scheduling).
International products requiring multilingual voice interaction.

Am I the target user?:

Building a voice AI product? → You are the core user.
Building a SaaS and want to add voice? → Definitely worth a look.
Just building standard CRUD apps? → Probably doesn't matter to you.

Common Use Cases:

Automated customer service calls → Use Realtime API + Twilio SIP.
In-browser voice interaction → Use WebRTC.
AI Voice Assistant Apps → Use the Agents SDK.
No need for voice? → You don't need this.

Is it useful for me?

Dimension	Benefit	Cost
Time	Saves the engineering effort of building a custom STT+LLM+TTS pipeline (at least 2-4 weeks).	Takes 1-3 days to learn the Realtime API.
Money	Significantly lower maintenance costs than a custom pipeline.	Audio tokens at $32/$64 per 1M are not cheap.
Effort	End-to-end S2S architecture reduces debugging steps.	Vendor lock-in risk; you're tied to GPT.

ROI Judgment: If your voice product handles a few thousand calls a month, the Realtime API is a bargain—it saves you the massive cost of building and tuning an STT/TTS pipeline. However, if you're hitting millions of calls, the costs will explode, and a modular solution like Deepgram + ElevenLabs might be more economical.

Will I like it?

The Highlights:

Tool calling is finally reliable: Calling tools used to be a gamble; now, the ComplexFuncBench score has jumped from 49.7% to 66.5%.
Massive boost in alphanumeric recognition: Transcription accuracy is up 10.23%. Order numbers and phone numbers are finally being captured correctly.
Asynchronous function calling: The AI doesn't have to sit in awkward silence while waiting for a tool to return; it can keep the conversation going with a "Just a moment while I check that."

The "Wow" Moment:

In the official demo, the model took a 7-digit mixed alphanumeric order number and repeated it back perfectly—something previous versions consistently failed at. — @kwindla

Real User Feedback:

Positive: "tool call stability optimized by over 25%, dramatically improved voice expressiveness" — @Comet (Perplexity Browser) Positive: "gpt-realtime-1.5 is the best native audio model on Scale AudioMultiChallenge benchmark" — @pbbakkum (OpenAI Engineer) Critique (Historical): "Realtime API seems pretty nerfed compared to Advanced Voice Mode" — OpenAI Community Developer

For Indie Hackers

Tech Stack

Model: gpt-realtime-1.5 (Native speech-to-speech, not an STT+LLM+TTS pipeline)
Protocols: WebRTC (Browser) / WebSocket (Server) / SIP (Telephony)
Audio Encoding: Opus (WebRTC handles echo cancellation, noise reduction, and gain control)
SDK: OpenAI Agents SDK (TypeScript is the primary recommendation, Python also supported)
Context Window: 32,768 tokens, with a max output of 4,096 tokens
Instruction + Tool Limit: 16,384 tokens

Core Architecture

The unique thing about gpt-realtime is its end-to-end Speech-to-Speech architecture. Traditional setups use a three-step chain: Speech-to-Text → LLM processing → Text-to-Speech. gpt-realtime collapses this into one step—the model "hears" and "speaks" audio directly, preserving tone, emotion, and non-verbal cues.

The recommended production architecture is Sideband Mode: The browser sends audio directly to OpenAI via WebRTC (for low latency), while your backend server connects to the same session via WebSocket to handle business logic (tool calls, database queries, etc.). This keeps the audio path short and your business logic private.

Browser ←──WebRTC──→ OpenAI Realtime API
                           ↕
Your Backend ←──WebSocket──→ (Same session)

Open Source Status

The Model: Proprietary, API only.
SDKs and Examples: MIT License, available on GitHub:
- openai-realtime-agents — Build a multi-agent voice app in 20 mins.
- openai-agents-js — TypeScript framework.
- openai-voice-agent-sdk-sample — Quick start examples.
Open Source Alternatives: Qwen3-Omni (Alibaba, end-to-end multimodal, 119 languages).
Difficulty to build yourself: Extremely high. End-to-end S2S models require massive audio datasets and compute power. However, building on top of the API is very accessible.

Business Model

Monetization: API billed per token.
Pricing:
- Text: $4/1M input, $16/1M output
- Audio: $32/1M input, $64/1M output
- Cached Input: $0.40/1M (97% savings! Using cache is key to controlling costs)
Comparison: 20% price drop compared to gpt-4o-realtime-preview.

Big Tech Risks

This is an OpenAI product, but looking at the competitive landscape:

Google has Gemini audio capabilities, but hasn't launched a direct Realtime API equivalent yet.
Anthropic is catching up with Claude Voice; the voice wars have begun.
Alibaba's Qwen3-Omni is the open-source disruptor.
The real risk isn't being replaced, but rather "commoditization"—if voice AI becomes a utility like water or electricity, profit margins for developers may shrink.

For Product Managers

Pain Point Analysis

Core Problem Solved: Voice AI agents failing at critical moments.

Specifically:

Unreliable Tool Calling — When an AI agent needs to check an order or inventory, it previously often called the wrong tool or passed incorrect parameters.
Poor Instruction Adherence — Telling the AI "don't reply in Chinese" only for it to reply in Chinese anyway.
Multilingual Switching — The user speaks Spanish, but the AI insists on English.

How painful is it?: High-frequency and critical. Any production-grade voice agent faces these issues; they determine whether a product is actually shippable. gpt-realtime-1.5 targets these exact pain points.

User Persona

Early Adopters: Perplexity (integrated into Comet browser), Genspark (stress-tested for bilingual translation).
Typical Customers: Mid-to-large enterprises needing AI phone support.
Developer Persona: Full-stack devs with WebRTC/WebSocket experience building voice products.

Feature Breakdown

Feature	Type	Description
Instruction Adherence +7%	Core	Directly impacts agent usability.
Enhanced Tool Calling	Core	Reliability is the barrier to production deployment.
Transcription Accuracy +10.23%	Core	Essential for order numbers, verification codes, etc.
Multilingual Accuracy	Core	A must-have for international products.
Async Function Calling	Nice-to-have	Keeps the conversation natural during wait times.
Placeholder Responses	Nice-to-have	Automatically says things like "One moment..."
SIP Connection	Core (Telephony)	Connects directly to traditional phone systems.

Competitor Differentiation

Dimension	gpt-realtime-1.5	ElevenLabs Agents	Deepgram Agent	Vapi
Architecture	End-to-end S2S	Modular STT+LLM+TTS	STT/TTS + Agent API	Orchestration Middleware
Core Advantage	Highest naturalness, emotional sensing	Best voice quality, cloning	Fast (<300ms), low cost	Flexible vendor mixing
LLM Lock-in	Yes (GPT only)	No (Multi-LLM)	Partial	No (Multi-LLM)
Best For	High-value conversations (VIP support)	Branded voice, audio content	High-throughput transcription	Best-of-breed needs
Estimated Cost	High	Medium	Low	$0.13-0.31+/min

Key Takeaways

Sideband Architecture: Audio on the fast path, logic on the secure path—this separation of concerns is a great design pattern.
Snapshot Versioning: Lock in model versions (e.g., gpt-realtime-1.5-2026-02-23) to ensure consistent behavior.
Progressive Degradation: Small features like placeholder responses and idle prompts solve the "awkward AI silence" problem.

For Tech Bloggers

Founder/Key Figures

This is a core OpenAI API line, but notable figures include:

Justin Uberti (@juberti): Engineering lead for OpenAI Realtime API and a WebRTC legend (early core engineer for Google's WebRTC project). He shared a demo number you can call: 425-800-0042.
Charlie Guo (@charlierguo): OpenAI DevRel, who recorded the official demo showing a full "AI food ordering" flow.
Peter Bakkum (@pbbakkum): OpenAI Engineer who shared benchmark data, calling it the "best native audio model on Scale AudioMultiChallenge."

Controversy / Discussion Angles

The "Voice War" Narrative: Anthropic has Claude Voice, Google has Gemini, Alibaba has Qwen3-Omni, and OpenAI is fighting back with the Realtime API. This signals a total war in the AI voice space.
API vs. Consumer Quality Gap: Developers often complain that the Realtime API isn't as good as ChatGPT's Advanced Voice Mode. Is this intentional differentiation or a technical limitation?
Vendor Lock-in: S2S end-to-end vs. modular pipelines. Many developers prefer the Deepgram + Claude + ElevenLabs combo to avoid being locked into the OpenAI ecosystem.
Ethics of AI Call Centers: One developer noted that "any drive-thru or call center is about to be replaced by AI voice."

Hype Metrics

ProductHunt: 274 votes.
Twitter: OpenAIDevs official tweet: 2109 likes, 175 reposts, 374K views.
Ecosystem Adoption: First-day integration by Perplexity's Comet browser.
Tech Community: Deep technical analysis available on Latent Space: "The Missing Manual".

Content Suggestions

The Big Picture: "The AI Voice Wars of 2026: A Battle of Three Giants' Technical Roadmaps."
Trending: "The Future of Browser Voice Interaction" (featuring Perplexity Comet).
Hands-on: "Build a Voice Agent in 20 Minutes with the Agents SDK."

For Early Adopters

Pricing Analysis

Tier	Price	Includes	Is it enough?
Text Input	$4/1M tokens	System instructions, text input	Cheap
Text Output	$16/1M tokens	Text responses	Cheap
Audio Input	$32/1M tokens	User voice	Expensive; the main cost driver
Audio Output	$64/1M tokens	AI voice response	Most expensive
Cached Input	$0.40/1M tokens	Repeated system instructions	97% savings; essential

Pro Tip: Use cached inputs ($0.40 vs $32) by designing your system instructions to be cacheable. Short, concise prompts can also drastically lower costs.

The 'Mini' Option: gpt-realtime-mini costs $10/$20 per 1M tokens (about 70% cheaper), ideal for scenarios where absolute precision isn't required.

Getting Started

Setup Time: 20 minutes (using official SDK examples).
Learning Curve: Moderate (requires understanding WebRTC or WebSockets).
Steps:
1. Get an OpenAI API Key.
2. Clone the openai-realtime-agents repo.
3. Install dependencies and set environment variables.
4. Run npm run dev and open localhost:3000.
5. Or just call the demo: 425-800-0042.

Pitfalls & Gripes

Echo Loops: The AI hears its own voice and thinks the user is talking, leading to infinite interruptions. Solution: Use WebRTC's built-in echo cancellation and avoid Firefox.
Keep Instructions Concise: The model gets confused if system prompts exceed ~750 characters.
Transcription isn't truly real-time: Transcription deltas only return after the user stops speaking. If you need real-time captions, this is a hurdle.
Gap with Advanced Voice Mode: The API's voice naturalness still lags behind the ChatGPT app version.
Firefox Issues: Poor echo cancellation; Chrome, Safari, or Edge are recommended.

Security & Privacy

Data Storage: Processed on OpenAI servers; EU data residency is an option (eu.api.openai.com).
Privacy Policy: Follows OpenAI's data usage policy; API data is not used for training by default.
Ephemeral Keys: Use temporary keys on the browser side to avoid exposing your master API Key.

Alternatives

Alternative	Advantage	Disadvantage
Deepgram + Claude + ElevenLabs	Flexible, no lock-in, best-of-breed	Complex integration, cumulative latency
Vapi	One-stop orchestration, multi-model	Extra $0.05/min fee, added latency
Qwen3-Omni (Open Source)	Free, self-hostable, 119 languages	Quality unverified, requires GPUs
gpt-realtime-mini	Same ecosystem, 70% cheaper	Noticeably weaker than the full version

For Investors

Market Analysis

Conversational AI Sector: $14.79B in 2025 → $17.97B in 2026 → $82.46B by 2034 (CAGR 21%).
Voice AI Agents: $2.4B in 2024 → $47.5B by 2034 (CAGR 34.8%).
Drivers:
- 80% of enterprises plan to integrate AI voice into customer service by 2026.
- US voice assistant users projected to reach 157.1M by 2026.
- Global enterprise AI spending hitting $391B.

Competitive Landscape

Layer	Players	Positioning
Model Layer	OpenAI (gpt-realtime), Google (Gemini), Anthropic (Claude Voice)	End-to-end voice AI models
Voice Layer	ElevenLabs, Deepgram	Specialized in voice quality/speed
Orchestration Layer	Vapi, Retell AI, Bland AI, Dasha	Voice agent platforms
Infrastructure	Twilio, LiveKit, Agora, Daily.co	Communication infrastructure

Timing Analysis

Why now?:

SIP Support — Voice AI can finally plug directly into phone networks, opening the trillion-dollar traditional call center market.
Production-Ready Tool Calling — Moving from 49.7% to 66.5% reliability makes deployment feasible.
Twilio's Channel Leverage — Through Twilio integration, gpt-realtime reaches 349K+ existing customers instantly.
Top-tier Integration — From Perplexity to Genspark, OpenAI is building a defensive moat through ecosystem nodes.

Team & Funding

OpenAI: One of the strongest teams in AI.
Justin Uberti: WebRTC pioneer, formerly led Google's WebRTC project.
Realtime API Team: Deep expertise at the intersection of RTC and AI modeling.
Funding: OpenAI has raised $13B+, valued at ~$150B. This is a core component of their ecosystem, not a standalone startup needing funding.

Conclusion

gpt-realtime-1.5 isn't a revolutionary leap, but it's the critical step that takes voice AI from "cool demo" to "production-ready tool." With Tool Calling up 25%, transcription up 10%, and instruction adherence up 7%, every number represents a previously frustrating bug that has been squashed.

User Type	Recommendation
Developers	Must Watch — If you're building voice products, this is the strongest S2S API available with excellent SDK support.
Product Managers	Worth Following — Keep the competitor comparison handy; the S2S vs. modular choice is your biggest architectural decision.
Bloggers	Good Angle — The "Voice War" is a great hook, though technical depth is better suited for this specific update.
Early Adopters	Cautiously Optimistic — Easy to start (20 mins), but watch the audio costs; try the mini version first.
Investors	Voice AI Confirmed — With a $47.5B market by 2034, OpenAI leads the model layer, but orchestration and infra still hold massive opportunities.

Resource Links

Resource	Link
Official Site	openai.com
API Docs	gpt-realtime-1.5 Model
Voice Agents Guide	Voice Agents Guide
Realtime API Docs	Realtime API
GitHub (Multi-Agent)	openai-realtime-agents
GitHub (JS SDK)	openai-agents-js
GitHub (Python SDK)	openai-agents-python
Official Demo	hello-realtime.val.run
Phone Demo	425-800-0042
Latent Space Analysis	The Missing Manual
Deepgram VAQI Comparison	VAQI Benchmark
Twitter @OpenAIDevs	Announcement Tweet
Twilio Integration	Twilio + OpenAI

2026-02-27 | Trend-Tracker v7.3

gpt-realtime-1.5 by OpenAI