Back to Explore

gpt-realtime-1.5 by OpenAI

API

Tighter instruction adherence in speech agents

💡 The most powerful platform for building AI products. Build and scale AI experiences powered by industry-leading models and tools.

"It's like upgrading from a nervous intern who mishears phone numbers to a seasoned receptionist who takes perfect notes and never misses a beat."

30-Second Verdict
What is it: An upgraded real-time voice model from OpenAI that makes AI agents more reliable at following instructions, using tools, and speaking multiple languages.
Worth attention: If you are building voice AI products, this is essential. gpt-realtime-1.5 offers a substantial boost in tool-calling reliability (+25%), solving a major production hurdle.
8/10

Hype

9/10

Utility

128

Votes

Product Profile
Full Analysis Report

gpt-realtime-1.5: The "Get to Work" Version of OpenAI's Voice Agents

2026-02-27 | ProductHunt | Official Site | API Docs

OpenAI Realtime API

OpenAI Realtime API Official Image — The "Agent online" interface, signaling the voice agent is ready.


30-Second Quick Judgment

What is it?: An upgraded version of OpenAI's real-time voice model that makes AI voice agents more reliable at following instructions, calling tools, and speaking multiple languages. Simply put, it makes your AI customer service calls much less "clunky."

Is it worth your attention?: If you're building voice AI products, absolutely. gpt-realtime-1.5 offers a substantial improvement in tool-calling reliability (up over 25%), which was the biggest complaint from developers. If you're just a casual user, this update won't affect you much—it's a pure API product for developers.


Three Key Questions

Does it matter to me?

Target Audience: Developers and enterprises building or planning to build voice AI products. Specifically:

  • Teams building AI call center systems.
  • Developers creating voice assistants (e.g., smart ordering, appointment scheduling).
  • International products requiring multilingual voice interaction.

Am I the target user?:

  • Building a voice AI product? → You are the core user.
  • Building a SaaS and want to add voice? → Definitely worth a look.
  • Just building standard CRUD apps? → Probably doesn't matter to you.

Common Use Cases:

  • Automated customer service calls → Use Realtime API + Twilio SIP.
  • In-browser voice interaction → Use WebRTC.
  • AI Voice Assistant Apps → Use the Agents SDK.
  • No need for voice? → You don't need this.

Is it useful for me?

DimensionBenefitCost
TimeSaves the engineering effort of building a custom STT+LLM+TTS pipeline (at least 2-4 weeks).Takes 1-3 days to learn the Realtime API.
MoneySignificantly lower maintenance costs than a custom pipeline.Audio tokens at $32/$64 per 1M are not cheap.
EffortEnd-to-end S2S architecture reduces debugging steps.Vendor lock-in risk; you're tied to GPT.

ROI Judgment: If your voice product handles a few thousand calls a month, the Realtime API is a bargain—it saves you the massive cost of building and tuning an STT/TTS pipeline. However, if you're hitting millions of calls, the costs will explode, and a modular solution like Deepgram + ElevenLabs might be more economical.

Will I like it?

The Highlights:

  • Tool calling is finally reliable: Calling tools used to be a gamble; now, the ComplexFuncBench score has jumped from 49.7% to 66.5%.
  • Massive boost in alphanumeric recognition: Transcription accuracy is up 10.23%. Order numbers and phone numbers are finally being captured correctly.
  • Asynchronous function calling: The AI doesn't have to sit in awkward silence while waiting for a tool to return; it can keep the conversation going with a "Just a moment while I check that."

The "Wow" Moment:

In the official demo, the model took a 7-digit mixed alphanumeric order number and repeated it back perfectly—something previous versions consistently failed at. — @kwindla

Real User Feedback:

Positive: "tool call stability optimized by over 25%, dramatically improved voice expressiveness" — @Comet (Perplexity Browser) Positive: "gpt-realtime-1.5 is the best native audio model on Scale AudioMultiChallenge benchmark" — @pbbakkum (OpenAI Engineer) Critique (Historical): "Realtime API seems pretty nerfed compared to Advanced Voice Mode" — OpenAI Community Developer


For Indie Hackers

Tech Stack

  • Model: gpt-realtime-1.5 (Native speech-to-speech, not an STT+LLM+TTS pipeline)
  • Protocols: WebRTC (Browser) / WebSocket (Server) / SIP (Telephony)
  • Audio Encoding: Opus (WebRTC handles echo cancellation, noise reduction, and gain control)
  • SDK: OpenAI Agents SDK (TypeScript is the primary recommendation, Python also supported)
  • Context Window: 32,768 tokens, with a max output of 4,096 tokens
  • Instruction + Tool Limit: 16,384 tokens

Core Architecture

The unique thing about gpt-realtime is its end-to-end Speech-to-Speech architecture. Traditional setups use a three-step chain: Speech-to-Text → LLM processing → Text-to-Speech. gpt-realtime collapses this into one step—the model "hears" and "speaks" audio directly, preserving tone, emotion, and non-verbal cues.

The recommended production architecture is Sideband Mode: The browser sends audio directly to OpenAI via WebRTC (for low latency), while your backend server connects to the same session via WebSocket to handle business logic (tool calls, database queries, etc.). This keeps the audio path short and your business logic private.

Browser ←──WebRTC──→ OpenAI Realtime API
                           ↕
Your Backend ←──WebSocket──→ (Same session)

Open Source Status

  • The Model: Proprietary, API only.
  • SDKs and Examples: MIT License, available on GitHub:
  • Open Source Alternatives: Qwen3-Omni (Alibaba, end-to-end multimodal, 119 languages).
  • Difficulty to build yourself: Extremely high. End-to-end S2S models require massive audio datasets and compute power. However, building on top of the API is very accessible.

Business Model

  • Monetization: API billed per token.
  • Pricing:
    • Text: $4/1M input, $16/1M output
    • Audio: $32/1M input, $64/1M output
    • Cached Input: $0.40/1M (97% savings! Using cache is key to controlling costs)
  • Comparison: 20% price drop compared to gpt-4o-realtime-preview.

Big Tech Risks

This is an OpenAI product, but looking at the competitive landscape:

  • Google has Gemini audio capabilities, but hasn't launched a direct Realtime API equivalent yet.
  • Anthropic is catching up with Claude Voice; the voice wars have begun.
  • Alibaba's Qwen3-Omni is the open-source disruptor.
  • The real risk isn't being replaced, but rather "commoditization"—if voice AI becomes a utility like water or electricity, profit margins for developers may shrink.

For Product Managers

Pain Point Analysis

Core Problem Solved: Voice AI agents failing at critical moments.

Specifically:

  1. Unreliable Tool Calling — When an AI agent needs to check an order or inventory, it previously often called the wrong tool or passed incorrect parameters.
  2. Poor Instruction Adherence — Telling the AI "don't reply in Chinese" only for it to reply in Chinese anyway.
  3. Multilingual Switching — The user speaks Spanish, but the AI insists on English.

How painful is it?: High-frequency and critical. Any production-grade voice agent faces these issues; they determine whether a product is actually shippable. gpt-realtime-1.5 targets these exact pain points.

User Persona

  • Early Adopters: Perplexity (integrated into Comet browser), Genspark (stress-tested for bilingual translation).
  • Typical Customers: Mid-to-large enterprises needing AI phone support.
  • Developer Persona: Full-stack devs with WebRTC/WebSocket experience building voice products.

Feature Breakdown

FeatureTypeDescription
Instruction Adherence +7%CoreDirectly impacts agent usability.
Enhanced Tool CallingCoreReliability is the barrier to production deployment.
Transcription Accuracy +10.23%CoreEssential for order numbers, verification codes, etc.
Multilingual AccuracyCoreA must-have for international products.
Async Function CallingNice-to-haveKeeps the conversation natural during wait times.
Placeholder ResponsesNice-to-haveAutomatically says things like "One moment..."
SIP ConnectionCore (Telephony)Connects directly to traditional phone systems.

Competitor Differentiation

Dimensiongpt-realtime-1.5ElevenLabs AgentsDeepgram AgentVapi
ArchitectureEnd-to-end S2SModular STT+LLM+TTSSTT/TTS + Agent APIOrchestration Middleware
Core AdvantageHighest naturalness, emotional sensingBest voice quality, cloningFast (<300ms), low costFlexible vendor mixing
LLM Lock-inYes (GPT only)No (Multi-LLM)PartialNo (Multi-LLM)
Best ForHigh-value conversations (VIP support)Branded voice, audio contentHigh-throughput transcriptionBest-of-breed needs
Estimated CostHighMediumLow$0.13-0.31+/min

Key Takeaways

  1. Sideband Architecture: Audio on the fast path, logic on the secure path—this separation of concerns is a great design pattern.
  2. Snapshot Versioning: Lock in model versions (e.g., gpt-realtime-1.5-2026-02-23) to ensure consistent behavior.
  3. Progressive Degradation: Small features like placeholder responses and idle prompts solve the "awkward AI silence" problem.

For Tech Bloggers

Founder/Key Figures

This is a core OpenAI API line, but notable figures include:

  • Justin Uberti (@juberti): Engineering lead for OpenAI Realtime API and a WebRTC legend (early core engineer for Google's WebRTC project). He shared a demo number you can call: 425-800-0042.
  • Charlie Guo (@charlierguo): OpenAI DevRel, who recorded the official demo showing a full "AI food ordering" flow.
  • Peter Bakkum (@pbbakkum): OpenAI Engineer who shared benchmark data, calling it the "best native audio model on Scale AudioMultiChallenge."

Controversy / Discussion Angles

  1. The "Voice War" Narrative: Anthropic has Claude Voice, Google has Gemini, Alibaba has Qwen3-Omni, and OpenAI is fighting back with the Realtime API. This signals a total war in the AI voice space.
  2. API vs. Consumer Quality Gap: Developers often complain that the Realtime API isn't as good as ChatGPT's Advanced Voice Mode. Is this intentional differentiation or a technical limitation?
  3. Vendor Lock-in: S2S end-to-end vs. modular pipelines. Many developers prefer the Deepgram + Claude + ElevenLabs combo to avoid being locked into the OpenAI ecosystem.
  4. Ethics of AI Call Centers: One developer noted that "any drive-thru or call center is about to be replaced by AI voice."

Hype Metrics

  • ProductHunt: 274 votes.
  • Twitter: OpenAIDevs official tweet: 2109 likes, 175 reposts, 374K views.
  • Ecosystem Adoption: First-day integration by Perplexity's Comet browser.
  • Tech Community: Deep technical analysis available on Latent Space: "The Missing Manual".

Content Suggestions

  • The Big Picture: "The AI Voice Wars of 2026: A Battle of Three Giants' Technical Roadmaps."
  • Trending: "The Future of Browser Voice Interaction" (featuring Perplexity Comet).
  • Hands-on: "Build a Voice Agent in 20 Minutes with the Agents SDK."

For Early Adopters

Pricing Analysis

TierPriceIncludesIs it enough?
Text Input$4/1M tokensSystem instructions, text inputCheap
Text Output$16/1M tokensText responsesCheap
Audio Input$32/1M tokensUser voiceExpensive; the main cost driver
Audio Output$64/1M tokensAI voice responseMost expensive
Cached Input$0.40/1M tokensRepeated system instructions97% savings; essential

Pro Tip: Use cached inputs ($0.40 vs $32) by designing your system instructions to be cacheable. Short, concise prompts can also drastically lower costs.

The 'Mini' Option: gpt-realtime-mini costs $10/$20 per 1M tokens (about 70% cheaper), ideal for scenarios where absolute precision isn't required.

Getting Started

  • Setup Time: 20 minutes (using official SDK examples).
  • Learning Curve: Moderate (requires understanding WebRTC or WebSockets).
  • Steps:
    1. Get an OpenAI API Key.
    2. Clone the openai-realtime-agents repo.
    3. Install dependencies and set environment variables.
    4. Run npm run dev and open localhost:3000.
    5. Or just call the demo: 425-800-0042.

Pitfalls & Gripes

  1. Echo Loops: The AI hears its own voice and thinks the user is talking, leading to infinite interruptions. Solution: Use WebRTC's built-in echo cancellation and avoid Firefox.
  2. Keep Instructions Concise: The model gets confused if system prompts exceed ~750 characters.
  3. Transcription isn't truly real-time: Transcription deltas only return after the user stops speaking. If you need real-time captions, this is a hurdle.
  4. Gap with Advanced Voice Mode: The API's voice naturalness still lags behind the ChatGPT app version.
  5. Firefox Issues: Poor echo cancellation; Chrome, Safari, or Edge are recommended.

Security & Privacy

  • Data Storage: Processed on OpenAI servers; EU data residency is an option (eu.api.openai.com).
  • Privacy Policy: Follows OpenAI's data usage policy; API data is not used for training by default.
  • Ephemeral Keys: Use temporary keys on the browser side to avoid exposing your master API Key.

Alternatives

AlternativeAdvantageDisadvantage
Deepgram + Claude + ElevenLabsFlexible, no lock-in, best-of-breedComplex integration, cumulative latency
VapiOne-stop orchestration, multi-modelExtra $0.05/min fee, added latency
Qwen3-Omni (Open Source)Free, self-hostable, 119 languagesQuality unverified, requires GPUs
gpt-realtime-miniSame ecosystem, 70% cheaperNoticeably weaker than the full version

For Investors

Market Analysis

  • Conversational AI Sector: $14.79B in 2025 → $17.97B in 2026 → $82.46B by 2034 (CAGR 21%).
  • Voice AI Agents: $2.4B in 2024 → $47.5B by 2034 (CAGR 34.8%).
  • Drivers:
    • 80% of enterprises plan to integrate AI voice into customer service by 2026.
    • US voice assistant users projected to reach 157.1M by 2026.
    • Global enterprise AI spending hitting $391B.

Competitive Landscape

LayerPlayersPositioning
Model LayerOpenAI (gpt-realtime), Google (Gemini), Anthropic (Claude Voice)End-to-end voice AI models
Voice LayerElevenLabs, DeepgramSpecialized in voice quality/speed
Orchestration LayerVapi, Retell AI, Bland AI, DashaVoice agent platforms
InfrastructureTwilio, LiveKit, Agora, Daily.coCommunication infrastructure

Timing Analysis

Why now?:

  1. SIP Support — Voice AI can finally plug directly into phone networks, opening the trillion-dollar traditional call center market.
  2. Production-Ready Tool Calling — Moving from 49.7% to 66.5% reliability makes deployment feasible.
  3. Twilio's Channel Leverage — Through Twilio integration, gpt-realtime reaches 349K+ existing customers instantly.
  4. Top-tier Integration — From Perplexity to Genspark, OpenAI is building a defensive moat through ecosystem nodes.

Team & Funding

  • OpenAI: One of the strongest teams in AI.
  • Justin Uberti: WebRTC pioneer, formerly led Google's WebRTC project.
  • Realtime API Team: Deep expertise at the intersection of RTC and AI modeling.
  • Funding: OpenAI has raised $13B+, valued at ~$150B. This is a core component of their ecosystem, not a standalone startup needing funding.

Conclusion

gpt-realtime-1.5 isn't a revolutionary leap, but it's the critical step that takes voice AI from "cool demo" to "production-ready tool." With Tool Calling up 25%, transcription up 10%, and instruction adherence up 7%, every number represents a previously frustrating bug that has been squashed.

User TypeRecommendation
DevelopersMust Watch — If you're building voice products, this is the strongest S2S API available with excellent SDK support.
Product ManagersWorth Following — Keep the competitor comparison handy; the S2S vs. modular choice is your biggest architectural decision.
BloggersGood Angle — The "Voice War" is a great hook, though technical depth is better suited for this specific update.
Early AdoptersCautiously Optimistic — Easy to start (20 mins), but watch the audio costs; try the mini version first.
InvestorsVoice AI Confirmed — With a $47.5B market by 2034, OpenAI leads the model layer, but orchestration and infra still hold massive opportunities.

Resource Links

ResourceLink
Official Siteopenai.com
API Docsgpt-realtime-1.5 Model
Voice Agents GuideVoice Agents Guide
Realtime API DocsRealtime API
GitHub (Multi-Agent)openai-realtime-agents
GitHub (JS SDK)openai-agents-js
GitHub (Python SDK)openai-agents-python
Official Demohello-realtime.val.run
Phone Demo425-800-0042
Latent Space AnalysisThe Missing Manual
Deepgram VAQI ComparisonVAQI Benchmark
Twitter @OpenAIDevsAnnouncement Tweet
Twilio IntegrationTwilio + OpenAI

2026-02-27 | Trend-Tracker v7.3

One-line Verdict

gpt-realtime-1.5 is the bridge that takes voice AI from a cool demo to a production-ready tool. By fixing core pain points like Tool Calling and transcription accuracy, it significantly boosts commercial viability and is a must-watch infrastructure update for developers.

FAQ

Frequently Asked Questions about gpt-realtime-1.5 by OpenAI

An upgraded real-time voice model from OpenAI that makes AI agents more reliable at following instructions, using tools, and speaking multiple languages.

The main features of gpt-realtime-1.5 by OpenAI include: Enhanced Tool Calling (+25%), Optimized Transcription Accuracy (+10.23%), Asynchronous Function Calling, Direct SIP Connection, Placeholder Responses.

Audio input $32/1M, output $64/1M; Text input $4/1M, output $16/1M; Cached input $0.40/1M.

Developers building voice AI products, enterprise AI customer service teams, and international product owners needing multilingual voice interaction.

Alternatives to gpt-realtime-1.5 by OpenAI include: ElevenLabs Agents (superior voice quality), Deepgram Agent (low latency/cost), Vapi (orchestration middleware), Qwen3-Omni (open source)..

Data source: ProductHuntFeb 26, 2026
Last updated: