Mercury 2: Bringing Diffusion to Text Generation for 10x Faster Speeds
2026-02-26 | https://www.producthunt.com/products/mercury-2

Gemini Interpretation: This is the Inception chat interface, featuring a minimalist dark design with a "Diffusion Effect" toggle and a "Mercury 2" model selector. The style is similar to Perplexity, focusing on a clean and fast user experience.
30-Second Quick Take
What is it?: Inception Labs has built an LLM that takes the road less traveled. Instead of spitting out tokens one by one, it works like an image diffusion model—generating a "draft" and refining multiple tokens simultaneously. The result: speeds exceeding 1,000+ tokens/sec, making it 13x faster than Claude Haiku and 15x faster than GPT-5 Mini.
Is it worth your attention?: Yes, but it depends on your use case.
Why?:
- If you're building AI Agents, real-time voice assistants, or code completion tools where latency is a dealbreaker, this is likely your most cost-effective choice.
- If you need the "smartest" model for complex reasoning or long-form writing, Mercury 2 isn't the top pick—its intelligence is roughly on par with Claude Haiku, not the "Opus" or "Pro" tier.
- The architectural innovation is fascinating; developers should keep a close eye on the Diffusion LLM direction.
How does it compare?

Gemini Interpretation: This benchmark chart visually demonstrates the massive gap between Mercury 2 (1009 t/s), Claude Haiku 4.5 (89 t/s), and GPT-5 Mini (71 t/s), labeled as ">5x faster."
| vs | Mercury 2 | Claude 4.5 Haiku | GPT 5.2 Mini |
|---|---|---|---|
| Core Difference | Diffusion architecture, parallel generation | Autoregressive, one token at a time | Autoregressive, one token at a time |
| Speed | 1,196 t/s | 89 t/s | 71 t/s |
| Latency | 1.7 seconds | 23.4 seconds | N/A |
| Output Price | $0.75/M | $5.00/M | N/A |
| Intelligence | Medium (AIME 91.1) | Comparable | Comparable |
| Advantage | Fast, Cheap | Mature ecosystem, stable | OpenAI Ecosystem |
Three Questions That Matter
Is this for me?
- Target Audience: Developers and companies building AI applications, especially those sensitive to latency and cost.
- Are you the one?: If you are doing the following, Mercury 2 is directly relevant to you:
- Building AI Agents that require rapid, iterative LLM calls.
- Creating real-time voice assistants where users can't wait 20 seconds for a reply.
- Developing code completion or editor plugins that need instant feedback.
- Handling large-scale batch processing where inference cost is the core concern.
- Use Cases:
- Agent Loops → Mercury 2’s 1.7s latency vs. 14-23s for others determines whether the product is even usable.
- Real-time Dialogue/Voice → Use this.
- Code Completion/Refactoring → Use this (already integrated with ProxyAI, Kilo Code, etc.).
- Deep Analysis/Long-form Writing → Not ideal; stick with Claude Opus or GPT-5.
Is it useful for me?
| Dimension | Benefit | Cost |
|---|---|---|
| Time | Agent loops are 10x faster; code completion is nearly instant | Learning a new API (though it is OpenAI-compatible) |
| Money | Output cost is 1/7th of Claude Haiku and 1/4th of Gemini Flash | Pay-as-you-go, $0.75 per million output tokens |
| Effort | No more choosing between "fast" or "smart"—if you need speed, this is it | Diffusion prompting techniques might differ slightly from traditional models |
ROI Judgment: If your scenario involves calling an LLM dozens of times per task (agents, search, code), the ROI for Mercury 2 is massive. You get 5-13x the speed at 4-7x lower cost. If you only call an API occasionally for a single chat, the difference is negligible.
Why will I love it?
The "Wow" Factor:
- The feel of speed: Moving from "waiting for an answer" to "instant response" is a qualitative shift in user experience.
- Cost disruption: Tasks that used to cost dollars to run via an Agent now cost cents.
What people are saying:
"The speed numbers are absurd. Around 1,000 tokens per second with end-to-end latency of 1.7 seconds. That's an order of magnitude faster." — @RuiDiaoX
Real User Feedback:
Positive: "Impressive inference speed from Inception Labs' diffusion LLMs. Diffusion LLMs are a fascinating alternative to conventional autoregressive LLMs. Well done!" — @AndrewYNg (Andrew Ng, 1224 likes)
Positive: "After trying Mercury, it's hard to go back. We are excited to roll out Mercury to support all of our voice agents." — Customer Feedback (Inception Labs Website)
Watching: "This is a very promising approach, and I hope they build larger, more capable models. If they achieve the performance of a Qwen3.5 34B, it would enable TurboTokens on home PCs." — @TeksEdge
For Indie Hackers & Developers
Tech Stack
- Core Architecture: Diffusion Large Language Model (dLLM), distinct from traditional autoregressive models.
- How it works: Starting from noise, it uses a Transformer network to denoise in multiple steps, modifying multiple tokens at once—similar to how Midjourney generates images, but for text.
- GPU: Runs on NVIDIA Blackwell GPUs.
- API Format: Provided via Inception API, 128K context window.
- Feature Support: Tool calling (function calling), JSON structured output.

Gemini Interpretation: On the left, an autoregressive LLM requires 75 iterations to generate code; on the right, the Inception Diffusion LLM completes the same task in just 14 iterations, a 5x+ efficiency boost.
Core Implementation
Simply put: Traditional LLMs are like typists hitting one key at a time. Mercury 2 is like an editor who quickly writes a rough draft and then polishes all necessary parts simultaneously. Because each step processes multiple tokens in parallel, the effective work done per neural network inference far exceeds autoregressive models.
This isn't just optimization (like better GPUs or model compression); it's a fundamental change in the path. Diffusion has already proven itself in image and video generation (Midjourney, Sora); Inception is now bringing it to language.
Open Source Status
- Model is not open source; available via API only.
- Third-party SDK: https://github.com/hamzaamjad/mercury-client
- Existing Integrations: ProxyAI, Buildglare, Kilo Code, browser-use.
- Paper: https://arxiv.org/abs/2506.17298
- Difficulty to replicate: Extremely high. Requires deep background in diffusion research and massive GPU resources. The founders are the Stanford professors who helped invent diffusion models.
Business Model
- Monetization: API usage-based billing.
- Pricing: $0.25/M input tokens, $0.75/M output tokens.
- Blended Price: ~$0.38/M tokens (extremely cheap).
- User Base: Not disclosed, but already integrated into several dev tools.
Big Tech Risk
This is the big question. Google is rumored to be working on "Gemini Diffusion." If Google releases a diffusion model that is both fast and flagship-smart, Inception's space could shrink. However:
- Inception is the first to bring diffusion LLMs to commercial scale, giving them a first-mover advantage.
- The founding team are academic authorities in this specific field.
- Investors include Microsoft, NVIDIA, and Databricks—giants choosing to invest rather than build it themselves (yet).
- The risk remains: if they can't scale intelligence beyond the "Haiku" level, speed alone might not be enough long-term.
For Product Managers
Pain Point Analysis
- Problem Solved: Slow LLM inference and high costs, which severely limit the deployment of Agents, real-time voice, and code completion.
- Severity: High. Many companies want to build AI Agents, but the latency makes for a poor user experience. An Agent task might require 20+ LLM calls; waiting 10-20 seconds for each is unacceptable for users.
User Persona
- Core Users: AI application developers, Agent platforms, Voice AI companies.
- Secondary Users: Code editor/IDE companies, search engines.
- Scenarios: Agent loops, real-time voice dialogue, auto-completion, large-scale text processing.
Feature Breakdown
| Feature | Type | Description |
|---|---|---|
| Ultra-fast Inference (1000+ t/s) | Core | The primary advantage of the diffusion architecture |
| Low Cost ($0.75/M output) | Core | 4-7x cheaper than competitors |
| Reasoning Capability (AIME 91.1) | Core | Comparable to Haiku-level models |
| Function Calling | Core | Essential for Agent scenarios |
| JSON Structured Output | Core | Developer-friendly |
| 128K Context | Nice-to-have | Sufficient but not industry-leading |
Competitive Differentiation
| vs | Mercury 2 | Groq (Llama) | Cerebras | SambaNova |
|---|---|---|---|---|
| Core Difference | Architectural Innovation | Hardware Acceleration (LPU) | Hardware Acceleration | Hardware Acceleration |
| Speed Source | Parallel Generation | Specialized Chips | Wafer-scale Chips | Custom Processors |
| Intelligence | Medium | Model-dependent | Model-dependent | Model-dependent |
| Pricing | $0.75/M output | Varies | Varies | Varies |
| Uniqueness | Innovation at the model level | Runs any model | Runs any model | Runs any model |
Key Takeaways
- "Speed as a Feature": Instead of competing on pure intelligence, they hit the extreme on the speed dimension to find a differentiated entry point.
- Academic to Commercial Path: A clear trajectory from paper → open research → commercial API.
- Simple Pricing: Only two price points (input/output) with no complex tiers, lowering the barrier to decision-making.
For Tech Bloggers
Founder Story
- Founders: Stefano Ermon (Stanford Professor), Aditya Grover (UCLA Professor), Volodymyr Kuleshov (Cornell Professor).
- Background: The trio has collaborated for over 10 years and were early researchers of core AI technologies like Diffusion Models, Flash Attention, and DPO (Direct Preference Optimization). Ermon himself was involved in the invention of diffusion models—the tech behind Midjourney and Sora.
- The Mission: Academia proved that diffusion could crush traditional methods in image and video; they want to replicate that success in text.
- Timeline: Founded in 2024 → Stealth debut Feb 2025 → $50M funding Nov 2025 → Mercury 2 release Feb 2026.
Discussion Angles
- Angle 1 - "Fast enough, but smart enough?": Mercury 2 targets the Haiku tier, not Opus or GPT-5. It's perfect for speed-reliant tasks, but can it handle high-level complexity?
- Angle 2 - "Diffusion vs. Autoregressive: The Future?": This is a battle of technical philosophies. If diffusion LLMs can reach flagship intelligence while maintaining speed, the industry landscape will be rewritten.
- Angle 3 - "The Inventors Step In": The founders are the literal inventors of the tech. A story of researchers commercializing their own breakthrough has natural appeal.
- Angle 4 - "The Andrew Ng + Andrej Karpathy Signal": When two AI godfathers bet on the same horse, it's a strong market signal.
Buzz Data
- PH Ranking: 14 votes (Low, as the product is B2B/Developer focused, not for general PH consumers).
- Twitter/X: Founder's tweet got 3,753 likes; Andrew Ng's retweet got 1,224 likes.
- HN Discussion: Dedicated thread (item?id=47144464).
- Media Coverage: Extensive reporting by Bloomberg, TechCrunch, Yahoo Finance, eWeek, InfoWorld, and The Decoder.
Content Suggestions
- Tech Deep Dive: How diffusion models are applied to text generation.
- Founder Profile: The journey from Stanford labs to a $50M startup.
- Market Analysis: The "Inference Speed Race" of 2026.
For Early Adopters
Pricing Analysis
| Tier | Price | Features | Is it enough? |
|---|---|---|---|
| API | $0.25/M input, $0.75/M output | Full model capability, 128K context, function calling, JSON output | Perfectly sufficient for speed-sensitive apps |
Currently API-only with no explicit free tier mentioned. However, the price is incredibly low—generating 1 million tokens (about a medium-length book) costs only $0.75.
Getting Started Guide
- Setup Time: 10-15 minutes.
- Learning Curve: Low (if you've used OpenAI's API).
- Steps:
- Apply for an API key at https://www.inceptionlabs.ai/.
- Install the SDK:
pip install mercury-client. - Call it like OpenAI; it supports function calling and JSON mode.
- Note: Diffusion prompts may need slight adjustments compared to GPT prompts.
Pitfalls & Complaints
- Verbosity: Mercury 2 tends to generate very long outputs. In evaluations, it generated 69M tokens where other models averaged 15M. You may need to explicitly prompt for brevity.
- Niche Robustness: It might be less stable than mature autoregressive models on extremely niche or highly specialized reasoning tasks.
- Fine-tuning Path: If you need to fine-tune, the process for diffusion models differs from traditional methods, and support is currently unclear.
- Ecosystem Lock-in: Only one API provider (Inception Labs), unlike the rich third-party toolchains for OpenAI or Anthropic.
Security & Privacy
- Data Storage: Cloud-based API.
- Privacy Policy: Refer to Inception Labs' specific terms.
- Security Audits: No public information available yet.
Alternatives
| Alternative | Advantage | Disadvantage |
|---|---|---|
| Claude 4.5 Haiku | Mature ecosystem, brand trust | 13x slower, 7x more expensive |
| GPT 5.2 Mini | OpenAI ecosystem, rich tools | 15x slower |
| Groq + Llama | Choice of models | Hardware acceleration, not architectural innovation |
| Gemini 3 Flash | Google ecosystem, multimodal | 4x more expensive, slower |
For Investors
Market Analysis
- Sector Size: AI inference market projected at $106.1B in 2025 → $255B by 2030 (19.2% CAGR).
- Growth Rate: Inference costs are dropping 10x annually.
- Drivers: The explosion of AI Agents requires massive low-latency inference; by 2026, inference cost will be the primary competitive factor.
Competitive Landscape
| Tier | Players | Positioning |
|---|---|---|
| Leaders | OpenAI, Anthropic, Google | All-rounders, leading in intelligence |
| Speed Tier | Groq, Cerebras, SambaNova | Hardware acceleration for existing models |
| Architectural Innovation | Inception Labs (Mercury 2) | Diffusion LLM, speed at the model level |
Timing Analysis
- Why Now?: Agents are the hottest application direction for 2026, but latency and cost are the main bottlenecks. Mercury 2 hits this pain point perfectly.
- Tech Maturity: Diffusion LLMs have academic backing; Mercury 2 is the first commercial-grade implementation.
- Market Readiness: Developers are used to API models, making switching costs low. The challenge is educating the market on the "Diffusion LLM" concept.
Team Background
- Founders: Stefano Ermon (Stanford), Aditya Grover (UCLA), Volodymyr Kuleshov (Cornell)—three top-tier professors.
- Core Contributions: Inventors of diffusion models, Flash Attention, DPO, etc.
- Track Record: Extremely high academic citations; among the most influential researchers in AI.
Funding Status
- Total Raised: $56 Million.
- Lead Investor: Menlo Ventures.
- Participants: Mayfield, Innovation Endeavors, NVentures (NVIDIA), M12 (Microsoft), Snowflake Ventures, Databricks Investment.
- Angel Investors: Andrew Ng, Andrej Karpathy.
- Total Investors: 13.
Conclusion
Mercury 2 is a genuine architectural innovation, not just a patch on old tech. Its speed advantage is overwhelming, but its intelligence remains at the Haiku level. The key to its future is whether it can scale up in capability.
| User Type | Recommendation |
|---|---|
| Developers | ✅ If you're building Agents or real-time apps, this is a must-try. Familiar API, low switching cost. |
| Product Managers | ✅ Watch this space. Lower speed and cost could unlock product forms that were previously impossible. |
| Bloggers | ✅ Great material. Founder story + technical rivalry + big-name backing. |
| Early Adopters | ✅ Worth a try; the API is very cheap. Just don't expect it to replace Claude Opus for complex tasks. |
| Investors | ✅ Top-tier team, great sector, perfect timing. The risk lies in Big Tech giants adopting similar architectures. |
Resource Links
| Resource | Link |
|---|---|
| Official Website | https://www.inceptionlabs.ai/ |
| Blog | https://www.inceptionlabs.ai/blog/introducing-mercury-2 |
| Artificial Analysis | https://artificialanalysis.ai/models/mercury-2 |
| Paper | https://arxiv.org/abs/2506.17298 |
| Python SDK | https://github.com/hamzaamjad/mercury-client |
| HN Discussion | https://news.ycombinator.com/item?id=47144464 |
| ProductHunt | https://www.producthunt.com/products/mercury-2 |
| Founder's Twitter | https://twitter.com/StefanoErmon |
2026-02-26 | Trend-Tracker v7.3