Fastest reasoning LLM built for instant production AI

Mercury 2: Bringing Diffusion to Text Generation for 10x Faster Speeds

2026-02-26 | https://www.producthunt.com/products/mercury-2

Mercury 2 Chat Interface

Gemini Interpretation: This is the Inception chat interface, featuring a minimalist dark design with a "Diffusion Effect" toggle and a "Mercury 2" model selector. The style is similar to Perplexity, focusing on a clean and fast user experience.

30-Second Quick Take

What is it?: Inception Labs has built an LLM that takes the road less traveled. Instead of spitting out tokens one by one, it works like an image diffusion model—generating a "draft" and refining multiple tokens simultaneously. The result: speeds exceeding 1,000+ tokens/sec, making it 13x faster than Claude Haiku and 15x faster than GPT-5 Mini.

Is it worth your attention?: Yes, but it depends on your use case.

Why?:

If you're building AI Agents, real-time voice assistants, or code completion tools where latency is a dealbreaker, this is likely your most cost-effective choice.
If you need the "smartest" model for complex reasoning or long-form writing, Mercury 2 isn't the top pick—its intelligence is roughly on par with Claude Haiku, not the "Opus" or "Pro" tier.
The architectural innovation is fascinating; developers should keep a close eye on the Diffusion LLM direction.

How does it compare?

Speed Comparison

Gemini Interpretation: This benchmark chart visually demonstrates the massive gap between Mercury 2 (1009 t/s), Claude Haiku 4.5 (89 t/s), and GPT-5 Mini (71 t/s), labeled as ">5x faster."

vs	Mercury 2	Claude 4.5 Haiku	GPT 5.2 Mini
Core Difference	Diffusion architecture, parallel generation	Autoregressive, one token at a time	Autoregressive, one token at a time
Speed	1,196 t/s	89 t/s	71 t/s
Latency	1.7 seconds	23.4 seconds	N/A
Output Price	$0.75/M	$5.00/M	N/A
Intelligence	Medium (AIME 91.1)	Comparable	Comparable
Advantage	Fast, Cheap	Mature ecosystem, stable	OpenAI Ecosystem

Three Questions That Matter

Is this for me?

Target Audience: Developers and companies building AI applications, especially those sensitive to latency and cost.
Are you the one?: If you are doing the following, Mercury 2 is directly relevant to you:
- Building AI Agents that require rapid, iterative LLM calls.
- Creating real-time voice assistants where users can't wait 20 seconds for a reply.
- Developing code completion or editor plugins that need instant feedback.
- Handling large-scale batch processing where inference cost is the core concern.
Use Cases:
- Agent Loops → Mercury 2’s 1.7s latency vs. 14-23s for others determines whether the product is even usable.
- Real-time Dialogue/Voice → Use this.
- Code Completion/Refactoring → Use this (already integrated with ProxyAI, Kilo Code, etc.).
- Deep Analysis/Long-form Writing → Not ideal; stick with Claude Opus or GPT-5.

Is it useful for me?

Dimension	Benefit	Cost
Time	Agent loops are 10x faster; code completion is nearly instant	Learning a new API (though it is OpenAI-compatible)
Money	Output cost is 1/7th of Claude Haiku and 1/4th of Gemini Flash	Pay-as-you-go, $0.75 per million output tokens
Effort	No more choosing between "fast" or "smart"—if you need speed, this is it	Diffusion prompting techniques might differ slightly from traditional models

ROI Judgment: If your scenario involves calling an LLM dozens of times per task (agents, search, code), the ROI for Mercury 2 is massive. You get 5-13x the speed at 4-7x lower cost. If you only call an API occasionally for a single chat, the difference is negligible.

Why will I love it?

The "Wow" Factor:

The feel of speed: Moving from "waiting for an answer" to "instant response" is a qualitative shift in user experience.
Cost disruption: Tasks that used to cost dollars to run via an Agent now cost cents.

What people are saying:

"The speed numbers are absurd. Around 1,000 tokens per second with end-to-end latency of 1.7 seconds. That's an order of magnitude faster." — @RuiDiaoX

Real User Feedback:

Positive: "Impressive inference speed from Inception Labs' diffusion LLMs. Diffusion LLMs are a fascinating alternative to conventional autoregressive LLMs. Well done!" — @AndrewYNg (Andrew Ng, 1224 likes)

Positive: "After trying Mercury, it's hard to go back. We are excited to roll out Mercury to support all of our voice agents." — Customer Feedback (Inception Labs Website)

Watching: "This is a very promising approach, and I hope they build larger, more capable models. If they achieve the performance of a Qwen3.5 34B, it would enable TurboTokens on home PCs." — @TeksEdge

For Indie Hackers & Developers

Tech Stack

Core Architecture: Diffusion Large Language Model (dLLM), distinct from traditional autoregressive models.
How it works: Starting from noise, it uses a Transformer network to denoise in multiple steps, modifying multiple tokens at once—similar to how Midjourney generates images, but for text.
GPU: Runs on NVIDIA Blackwell GPUs.
API Format: Provided via Inception API, 128K context window.
Feature Support: Tool calling (function calling), JSON structured output.

Diffusion vs Autoregressive Comparison

Gemini Interpretation: On the left, an autoregressive LLM requires 75 iterations to generate code; on the right, the Inception Diffusion LLM completes the same task in just 14 iterations, a 5x+ efficiency boost.

Core Implementation

Simply put: Traditional LLMs are like typists hitting one key at a time. Mercury 2 is like an editor who quickly writes a rough draft and then polishes all necessary parts simultaneously. Because each step processes multiple tokens in parallel, the effective work done per neural network inference far exceeds autoregressive models.

This isn't just optimization (like better GPUs or model compression); it's a fundamental change in the path. Diffusion has already proven itself in image and video generation (Midjourney, Sora); Inception is now bringing it to language.

Open Source Status

Model is not open source; available via API only.
Third-party SDK: https://github.com/hamzaamjad/mercury-client
Existing Integrations: ProxyAI, Buildglare, Kilo Code, browser-use.
Paper: https://arxiv.org/abs/2506.17298
Difficulty to replicate: Extremely high. Requires deep background in diffusion research and massive GPU resources. The founders are the Stanford professors who helped invent diffusion models.

Business Model

Monetization: API usage-based billing.
Pricing: $0.25/M input tokens, $0.75/M output tokens.
Blended Price: ~$0.38/M tokens (extremely cheap).
User Base: Not disclosed, but already integrated into several dev tools.

Big Tech Risk

This is the big question. Google is rumored to be working on "Gemini Diffusion." If Google releases a diffusion model that is both fast and flagship-smart, Inception's space could shrink. However:

Inception is the first to bring diffusion LLMs to commercial scale, giving them a first-mover advantage.
The founding team are academic authorities in this specific field.
Investors include Microsoft, NVIDIA, and Databricks—giants choosing to invest rather than build it themselves (yet).
The risk remains: if they can't scale intelligence beyond the "Haiku" level, speed alone might not be enough long-term.

For Product Managers

Pain Point Analysis

Problem Solved: Slow LLM inference and high costs, which severely limit the deployment of Agents, real-time voice, and code completion.
Severity: High. Many companies want to build AI Agents, but the latency makes for a poor user experience. An Agent task might require 20+ LLM calls; waiting 10-20 seconds for each is unacceptable for users.

User Persona

Core Users: AI application developers, Agent platforms, Voice AI companies.
Secondary Users: Code editor/IDE companies, search engines.
Scenarios: Agent loops, real-time voice dialogue, auto-completion, large-scale text processing.

Feature Breakdown

Feature	Type	Description
Ultra-fast Inference (1000+ t/s)	Core	The primary advantage of the diffusion architecture
Low Cost ($0.75/M output)	Core	4-7x cheaper than competitors
Reasoning Capability (AIME 91.1)	Core	Comparable to Haiku-level models
Function Calling	Core	Essential for Agent scenarios
JSON Structured Output	Core	Developer-friendly
128K Context	Nice-to-have	Sufficient but not industry-leading

Competitive Differentiation

vs	Mercury 2	Groq (Llama)	Cerebras	SambaNova
Core Difference	Architectural Innovation	Hardware Acceleration (LPU)	Hardware Acceleration	Hardware Acceleration
Speed Source	Parallel Generation	Specialized Chips	Wafer-scale Chips	Custom Processors
Intelligence	Medium	Model-dependent	Model-dependent	Model-dependent
Pricing	$0.75/M output	Varies	Varies	Varies
Uniqueness	Innovation at the model level	Runs any model	Runs any model	Runs any model

Key Takeaways

"Speed as a Feature": Instead of competing on pure intelligence, they hit the extreme on the speed dimension to find a differentiated entry point.
Academic to Commercial Path: A clear trajectory from paper → open research → commercial API.
Simple Pricing: Only two price points (input/output) with no complex tiers, lowering the barrier to decision-making.

For Tech Bloggers

Founder Story

Founders: Stefano Ermon (Stanford Professor), Aditya Grover (UCLA Professor), Volodymyr Kuleshov (Cornell Professor).
Background: The trio has collaborated for over 10 years and were early researchers of core AI technologies like Diffusion Models, Flash Attention, and DPO (Direct Preference Optimization). Ermon himself was involved in the invention of diffusion models—the tech behind Midjourney and Sora.
The Mission: Academia proved that diffusion could crush traditional methods in image and video; they want to replicate that success in text.
Timeline: Founded in 2024 → Stealth debut Feb 2025 → $50M funding Nov 2025 → Mercury 2 release Feb 2026.

Discussion Angles

Angle 1 - "Fast enough, but smart enough?": Mercury 2 targets the Haiku tier, not Opus or GPT-5. It's perfect for speed-reliant tasks, but can it handle high-level complexity?
Angle 2 - "Diffusion vs. Autoregressive: The Future?": This is a battle of technical philosophies. If diffusion LLMs can reach flagship intelligence while maintaining speed, the industry landscape will be rewritten.
Angle 3 - "The Inventors Step In": The founders are the literal inventors of the tech. A story of researchers commercializing their own breakthrough has natural appeal.
Angle 4 - "The Andrew Ng + Andrej Karpathy Signal": When two AI godfathers bet on the same horse, it's a strong market signal.

Buzz Data

PH Ranking: 14 votes (Low, as the product is B2B/Developer focused, not for general PH consumers).
Twitter/X: Founder's tweet got 3,753 likes; Andrew Ng's retweet got 1,224 likes.
HN Discussion: Dedicated thread (item?id=47144464).
Media Coverage: Extensive reporting by Bloomberg, TechCrunch, Yahoo Finance, eWeek, InfoWorld, and The Decoder.

Content Suggestions

Tech Deep Dive: How diffusion models are applied to text generation.
Founder Profile: The journey from Stanford labs to a $50M startup.
Market Analysis: The "Inference Speed Race" of 2026.

For Early Adopters

Pricing Analysis

Tier	Price	Features	Is it enough?
API	$0.25/M input, $0.75/M output	Full model capability, 128K context, function calling, JSON output	Perfectly sufficient for speed-sensitive apps

Currently API-only with no explicit free tier mentioned. However, the price is incredibly low—generating 1 million tokens (about a medium-length book) costs only $0.75.

Getting Started Guide

Setup Time: 10-15 minutes.
Learning Curve: Low (if you've used OpenAI's API).
Steps:
1. Apply for an API key at https://www.inceptionlabs.ai/.
2. Install the SDK: pip install mercury-client.
3. Call it like OpenAI; it supports function calling and JSON mode.
4. Note: Diffusion prompts may need slight adjustments compared to GPT prompts.

Pitfalls & Complaints

Verbosity: Mercury 2 tends to generate very long outputs. In evaluations, it generated 69M tokens where other models averaged 15M. You may need to explicitly prompt for brevity.
Niche Robustness: It might be less stable than mature autoregressive models on extremely niche or highly specialized reasoning tasks.
Fine-tuning Path: If you need to fine-tune, the process for diffusion models differs from traditional methods, and support is currently unclear.
Ecosystem Lock-in: Only one API provider (Inception Labs), unlike the rich third-party toolchains for OpenAI or Anthropic.

Security & Privacy

Data Storage: Cloud-based API.
Privacy Policy: Refer to Inception Labs' specific terms.
Security Audits: No public information available yet.

Alternatives

Alternative	Advantage	Disadvantage
Claude 4.5 Haiku	Mature ecosystem, brand trust	13x slower, 7x more expensive
GPT 5.2 Mini	OpenAI ecosystem, rich tools	15x slower
Groq + Llama	Choice of models	Hardware acceleration, not architectural innovation
Gemini 3 Flash	Google ecosystem, multimodal	4x more expensive, slower

For Investors

Market Analysis

Sector Size: AI inference market projected at $106.1B in 2025 → $255B by 2030 (19.2% CAGR).
Growth Rate: Inference costs are dropping 10x annually.
Drivers: The explosion of AI Agents requires massive low-latency inference; by 2026, inference cost will be the primary competitive factor.

Competitive Landscape

Tier	Players	Positioning
Leaders	OpenAI, Anthropic, Google	All-rounders, leading in intelligence
Speed Tier	Groq, Cerebras, SambaNova	Hardware acceleration for existing models
Architectural Innovation	Inception Labs (Mercury 2)	Diffusion LLM, speed at the model level

Timing Analysis

Why Now?: Agents are the hottest application direction for 2026, but latency and cost are the main bottlenecks. Mercury 2 hits this pain point perfectly.
Tech Maturity: Diffusion LLMs have academic backing; Mercury 2 is the first commercial-grade implementation.
Market Readiness: Developers are used to API models, making switching costs low. The challenge is educating the market on the "Diffusion LLM" concept.

Team Background

Founders: Stefano Ermon (Stanford), Aditya Grover (UCLA), Volodymyr Kuleshov (Cornell)—three top-tier professors.
Core Contributions: Inventors of diffusion models, Flash Attention, DPO, etc.
Track Record: Extremely high academic citations; among the most influential researchers in AI.

Funding Status

Total Raised: $56 Million.
Lead Investor: Menlo Ventures.
Participants: Mayfield, Innovation Endeavors, NVentures (NVIDIA), M12 (Microsoft), Snowflake Ventures, Databricks Investment.
Angel Investors: Andrew Ng, Andrej Karpathy.
Total Investors: 13.

Conclusion

Mercury 2 is a genuine architectural innovation, not just a patch on old tech. Its speed advantage is overwhelming, but its intelligence remains at the Haiku level. The key to its future is whether it can scale up in capability.

User Type	Recommendation
Developers	✅ If you're building Agents or real-time apps, this is a must-try. Familiar API, low switching cost.
Product Managers	✅ Watch this space. Lower speed and cost could unlock product forms that were previously impossible.
Bloggers	✅ Great material. Founder story + technical rivalry + big-name backing.
Early Adopters	✅ Worth a try; the API is very cheap. Just don't expect it to replace Claude Opus for complex tasks.
Investors	✅ Top-tier team, great sector, perfect timing. The risk lies in Big Tech giants adopting similar architectures.

Resource Links

Resource	Link
Official Website	https://www.inceptionlabs.ai/
Blog	https://www.inceptionlabs.ai/blog/introducing-mercury-2
Artificial Analysis	https://artificialanalysis.ai/models/mercury-2
Paper	https://arxiv.org/abs/2506.17298
Python SDK	https://github.com/hamzaamjad/mercury-client
HN Discussion	https://news.ycombinator.com/item?id=47144464
ProductHunt	https://www.producthunt.com/products/mercury-2
Founder's Twitter	https://twitter.com/StefanoErmon

2026-02-26 | Trend-Tracker v7.3

Mercury 2