Back to Explore

Fish Audio S2

Text-to-Speech Software

Real Expressive AI Voices

💡 Fish Audio is the most expressive and emotionally rich text-to-speech model available today. It generates lifelike voices that capture emotion, rhythm, and nuance with remarkable realism. With Fish Audio Voice Clone, you can recreate a natural-sounding voice from just 10 seconds of audio—preserving accents, tones, and unique speaking habits. Proudly built by the open-source team behind So-VITS-SVC and Bert-VITS2, it truly gives a soul to every digital voice.

"Fish Audio S2 is like a world-class voice actor who lives in your pocket, ready to laugh, whisper, or cry on command."

8/10

Hype

9/10

Utility

569

Votes

Product Profile
Full Analysis Report

Fish Audio S2: The 'Emotional Revolution' in Open-Source TTS, ElevenLabs' Strongest Challenger

2026-03-10 | ProductHunt · Official Site · GitHub


30-Second Quick Judgment

What does it do?: An open-source AI voice synthesis model. If you type [whisper], [laugh], or [sigh] in your text, it actually whispers, laughs, or sighs. It supports 80+ languages and can clone your voice with just a 10-second reference clip.

Is it worth your attention?: Absolutely. This is currently the top-performing TTS model in benchmarks, beating out OpenAI's gpt-4o-mini-tts, ByteDance's Seed-TTS, and MiniMax Speech. More importantly—its API is only 1/4 the price of ElevenLabs. If you're into podcasts, audiobooks, game dubbing, or AI customer service, this will directly disrupt your cost structure.


Three Questions: Why it matters to you

Is it relevant to me?

Target users:

  • Podcast/audiobook content creators
  • YouTubers needing multilingual dubbing
  • Developers building AI voice assistants or customer service
  • Game dev teams needing NPC dialogue
  • Educational content companies

Am I the target?: If you've ever needed to turn text into speech—whether for translation, podcasting, or game characters—you are the target user.

When would I use it?:

  • Producing daily multilingual podcast content → Use the S2 API for batch generation.
  • Creating an emotional AI customer service agent → Use inline tags to control tone.
  • Needing multi-character dialogue in a game → Generate multiple roles at once without separate recordings.
  • Tired of paying ElevenLabs $99/month → S2 Pro is only $75/month and gives you 7x more time.

Is it actually useful?

DimensionBenefitCost
TimeGenerate multi-character dialogue at once, saving time on separate recording/synthesis~30 mins to learn Inline Tags syntax
MoneyAPI is 75%+ cheaper than ElevenLabs; self-hosting is nearly freeStarts at $11/mo for Plus; self-hosting needs a 24GB GPU
EffortGenerate voice with just 3 lines of code via SDKSelf-hosting requires tinkering with CUDA/SGLang

ROI Judgment: If you currently use ElevenLabs and spend over $50/month, switching now will save you a fortune. If you only use TTS occasionally, the free 7 mins/month is plenty for a trial.

Why you'll love it

The "Aha!" moments:

  • Emotion control is magic: Type [whisper] and it actually lowers its voice; type [sigh] and it actually sighs. It’s not just a simple pitch shift.
  • Multi-role generation: No need to upload reference audio for every character separately; handle an entire conversation in one go.
  • 80+ languages with zero config: No phoneme labeling needed; just drop in Chinese, Japanese, or Arabic and it speaks.

A "Wow" moment:

"I typed [laughing nervously], and the AI actually laughed. That's when I realized AI voice has truly grown up." — @anujcodes_21

Real User Reviews:

"The most expressive open-weight TTS model; voice cloning works flawlessly in Arabic, German, and English." — @fahdmirza

"AI voice cloning just got dangerous." — @hasantoxr

"Sound quality and naturalness are solid for TTS—perfect for real-time chat, multi-character stories, and long-form reading." — @aigclink


For Indie Developers

Tech Stack

  • Model Architecture: Dual-AR (Dual Auto-Regressive), based on a Qwen3 backbone
    • Slow AR: 4B parameters, predicting semantic codebooks along the time axis
    • Fast AR: 400M parameters, generating 9 residual codebooks per time step
  • Audio Encoding: RVQ-based codec, 10 codebooks, ~21 Hz frame rate
  • Post-training: GRPO (Group Relative Policy Optimization) reinforcement learning alignment
  • Inference Engine: SGLang (featuring continuous batching, paged KV cache, and CUDA graph replay)
  • Training Data: 10M+ hours of audio, 80+ languages
  • SDK: Python fish-audio-sdk / Node.js fish-audio-sdk

Core Feature Implementation

Simply put, S2's core innovation is turning "emotion control" into natural language instructions rather than fixed SSML tags. You can insert descriptions like [whisper], [excited], or [pitch up] anywhere in the text, and the model changes style at that exact spot. This is a hundred times more flexible than Google/Azure's <prosody rate="slow">.

Multi-character generation is also clever—using the <|speaker:i|> token to mark different speakers, you can handle an entire dialogue in a single inference pass without running them separately.

Open Source Status

  • Code: Apache 2.0 (True open source)
  • Model Weights: Fish Audio Research License (Free for research, commercial use requires authorization)
  • Here's the catch: Twitter community notes point out this isn't "true" open source; it's more accurately "source-available."
  • GitHub: fishaudio/fish-speech
  • HuggingFace: fishaudio/s2-pro
  • Technical Report: arxiv 2603.08823
  • Similar Projects: Coqui TTS (Discontinued), StyleTTS2, XTTS
  • Difficulty to replicate: Extremely high. A 4.4B parameter model + 10M hours of training data is nearly impossible to reproduce without a large-scale GPU cluster.

Business Model

  • Monetization: Usage-based API subscription
  • Pricing: Free $0 (7 mins/mo) → Plus $11/mo (200 mins) → Pro $75/mo (27 hours)
  • MAU: 420,000+ (Mid-2025)
  • ARR: $5M+ (April 2025)
  • Active Developers: 20,000+

Giant Risk

Medium-high. OpenAI has gpt-4o-mini-tts, Google has Cloud TTS, and Microsoft has Azure Speech. However, S2 beats these giants in benchmarks and has a more aggressive pricing strategy. The real risk isn't being shut down, but ElevenLabs following suit with price cuts. That said, Fish Audio's open-source ecosystem (built from So-VITS-SVC, GPT-SoVITS, etc.) is a strong moat.


For Product Managers

Pain Point Analysis

  • Problem Solved: Bridges the gap between cheap but robotic models (Google/Azure) and expressive but expensive ones (ElevenLabs).
  • Severity: High-frequency demand. Any scenario requiring "AI to speak" needs TTS, and users are increasingly intolerant of robotic tones.

User Persona

  • Primary Users: Developers (integrating into their own products), content creators (podcasts/audiobooks).
  • Secondary Users: Enterprise clients (customer service/IVR), game developers.
  • Use Cases: Batch content production, real-time conversational AI, multilingual localized dubbing.

Feature Breakdown

FeatureTypeDescription
Inline Tags Emotion ControlCoreNatural language instructions to control tone, with 15,000+ descriptions
Zero-Shot Voice CloningCoreClone a voice with 10-30 seconds of reference audio
Multi-Speaker SupportCoreGenerate multi-person dialogue in a single pass
80+ Language SupportCoreNo phoneme preprocessing required
<150ms LatencyNice-to-haveEssential for real-time conversation scenarios
Self-hostingNice-to-haveFor data-sensitive enterprises

Competitor Comparison

DimensionFish Audio S2ElevenLabsGoogle Cloud TTSOpenAI TTS
Core DifferentiatorNatural language emotion controlMature voice marketplace ecosystemEnterprise-grade stabilityGPT ecosystem integration
PricePro $75/mo (27h)Pro $99/mo (500k credits)Per characterPer token
Open SourceCode open / Weights restrictedFully closedFully closedFully closed
ProsCheap + Expressive + OpenMature ecosystem + VarietyStable & ReliableSeamless GPT integration
ConsLicense controversy, GPU heavyExpensiveWeak expressivenessAverage expressiveness

Key Takeaways

  1. Inline Tags Design: Embedding control instructions into the text itself rather than using a separate markup language significantly lowers the barrier to entry.
  2. "Open Source First" Strategy: Building a community through So-VITS-SVC and GPT-SoVITS before monetizing via API has created a massive user base.
  3. Benchmark-Driven Narrative: Instead of just saying "we're good," they use data to prove they are better than OpenAI and Google.

For Tech Bloggers

Founder Story

  • Shijia Liao (Leng Yue), Gen Z.
  • Former NVIDIA researcher with 7+ years in the AI audio field.
  • A legend in the open-source world—author/core contributor of viral projects like So-VITS-SVC, GPT-SoVITS, and Bert-VITS2.
  • After leaving NVIDIA, he started Fish Audio using an RTX 4090 GPU at home.
  • A 4-person Gen Z founding team that grew ARR from $400K to $5M in just 3 months.
  • Accepted into the HF0 incubator (a YC-level AI-specific incubator, $1M SAFE for 5%).

Story Angle: A Gen Z open-source master leaves NVIDIA to build a TTS model on a home 4090 that beats OpenAI and Google. It’s a classic underdog story.

Controversies/Discussion Angles

  • The "Pseudo-Open Source" Debate: Code is Apache, but weights have a non-commercial license. Twitter community notes have flagged this as "misleading."
  • AI Voice Ethics: The ability to clone anyone's voice in 10 seconds raises serious deepfake concerns.
  • China Team vs. US Registration: Registered in Delaware, but the team background and open-source roots are in China.
  • Open Source vs. Commercialization: How to find the balance between "letting everyone use it" and "making enough to sustain the team."

Hype Data

  • ProductHunt: 569 votes
  • Twitter: Retweeted by multiple influencers with 10k+ followers (Fahd Mirza, Hasan Toor, etc.).
  • LMSYS Official tweeted congratulations (one of the most authoritative orgs in LLM evaluation).
  • Arxiv Technical Report: Gaining significant academic attention.
  • Reddit r/LocalLLaMA: High-engagement discussions.

Content Suggestions

  • Angle: "From a 4090 to Beating OpenAI: The Startup Story of a Gen Z Open-Source TTS."
  • Trending Topics: AI voice cloning security discussions, the Open Source vs. Closed Source AI roadmap debate.
  • Video Direction: A side-by-side test of ElevenLabs vs. Fish Audio S2 to let the audience hear the difference.

For Early Adopters

Pricing Analysis

TierPriceFeaturesIs it enough?
Free$07 mins/mo, 8000 credits, personal use onlyGood for testing, not for production
Plus$11/mo200 mins, API access, commercial licenseEnough for individual creators
Pro$75/mo27 hours, priority, 30,000 chars/requestEnough for medium-output teams

Vs. ElevenLabs: ElevenLabs Pro at $99/mo gives only 500k credits (~100 mins of high-quality voice). Fish Audio Pro at $75/mo gives 27 hours. The value for money is overwhelming.

Getting Started Guide

  • Setup Time: 5 mins (API) / 30 mins (Self-hosting)
  • Learning Curve: Low (API) / Medium-High (Self-hosting)
  • Steps:
    1. Register at fish.audio.
    2. Create an app to get your API Key.
    3. pip install fish-audio-sdk
    4. Three lines of code to generate voice:
from fishaudio import FishAudio
client = FishAudio(api_key="your_key")
audio = client.tts.convert(text="Hello [whisper] this is a secret [/whisper]")

Pitfalls and Complaints

  1. High GPU barrier: Self-hosting needs at least 12GB VRAM, 24GB recommended. An RTX 3060 takes 15s for 1 min of audio.
  2. "Open Source" with a catch: Commercial use of model weights requires a separate license—don't assume it's free for your product just because you can download it.
  3. English naturalness: It won benchmarks, but to the human ear, some English voices aren't quite as natural as ElevenLabs yet.
  4. Tag conflicts: Don't overdo it. Putting [whisper] and [excited] together might confuse the model.
  5. Strict free tier: 7 mins/month is fine for a demo, but not for real work.

Security and Privacy

  • Data Storage: API calls go through the cloud (fish.audio servers); self-hosting is entirely local.
  • Privacy Policy: Registered in Delaware, complying with US laws.
  • Voice Cloning Risk: 10 seconds is all it takes. While the platform has terms of use, the technology itself cannot be fully prevented from abuse.

Alternatives

AlternativeProsCons
ElevenLabsMature ecosystem, stable quality, large market3-4x more expensive
OpenAI TTSGreat GPT ecosystem integrationLess expressive than S2
StyleTTS2Fully free and open sourcePerformance lags behind S2
BarkFree, supports non-speech sound effectsQuality lags behind S2
XTTSStrong community (by Coqui)Project is discontinued

For Investors

Market Analysis

  • Market Size: Global TTS market ~$4B in 2025, projected to reach $7.6-8.3B by 2030.
  • Growth Rate: CAGR from 12-16% (conservative) to 23% (optimistic).
  • Drivers: AI Agents needing to speak, the explosion of podcasts/audiobooks, accessibility needs, and automotive voice interaction.

Competitive Landscape

TierPlayersPositioning
Top TierMicrosoft, Google, ElevenLabsCloud TTS Services
Mid TierOpenAI, MiniMax, ByteDance (Seed-TTS)AI-Native TTS
New EntrantsFish AudioOpen Source Community + API Service

Fish Audio's position is unique—using open source to capture the market and API for monetization, similar to Hugging Face's path with NLP models.

Timing Analysis

  • Why Now?: The AI Agent wave has created a massive need for "AI that can talk"; falling inference costs allow 4.4B parameter models to be commercially viable.
  • Tech Maturity: Benchmarks show it has surpassed closed-source solutions, though production stability needs time to be proven.
  • Market Readiness: ElevenLabs has already educated the market; users know what AI voice can do but find it too expensive—Fish Audio is perfectly positioned to capture this demand.

Team Background

  • Founder: Shijia Liao (Leng Yue), former NVIDIA researcher with 7+ years in AI audio.
  • Core Team: 4-person Gen Z founding team.
  • Track Record: Projects like So-VITS-SVC and GPT-SoVITS have massive influence in the AI synthesis community.
  • Chief Scientist: Former NVIDIA + University of Maryland researcher.

Funding Status

  • Known Funding: HF0 incubator ($1M uncapped SAFE for 5%) + at least one pre-HF0 round.
  • Investors: Specific institutions not disclosed.
  • ARR: $5M+ (April 2025).
  • MAU: 420,000+.

Conclusion

Fish Audio S2 is one of the most significant AI voice releases of March 2026. With open-source code, top-tier benchmarks, and aggressive pricing, it is directly challenging ElevenLabs' dominance.

User TypeRecommendation
Developers✅ Highly Recommended. Leading tech, cheap API, great SDK. Watch the weight license.
Product Managers✅ Recommended. Inline Tags are a brilliant design choice; the pricing strategy is smart.
Bloggers✅ Great for content. Gen Z founder vs. Giants + Open Source debate + AI Ethics.
Early Adopters✅ Recommended. Start with the free tier; the $11/mo Plus tier is enough for most creators.
Investors✅ Worth watching. $4B market, $5M ARR, 420K MAU—the growth flywheel is spinning.

Resource Links

ResourceLink
Official Sitefish.audio
S2 Product Pagefish.audio/s2
GitHubfishaudio/fish-speech
HuggingFacefishaudio/s2-pro
Technical Reportarxiv 2603.08823
API Docsdocs.fish.audio
Python SDKfishaudio/fish-audio-python
ProductHuntproducthunt.com/products/fish-audio-s2
Twitter@FishAudio
Founder LinkedInShijia Liao
Blogfish.audio/blog

2026-03-16 | Trend-Tracker v7.3

One-line Verdict

Fish Audio S2 is currently the strongest open-source TTS challenger. With top-tier expressiveness and extreme cost-efficiency, it is the go-to alternative for developers and creators.

FAQ

Frequently Asked Questions about Fish Audio S2

Real Expressive AI Voices

The main features of Fish Audio S2 include: Inline Tags for natural language emotion control, Zero-Shot 10-second voice cloning, Multi-character dialogue generation in a single pass, 80+ language support.

Free (7 mins/mo), Plus ($11/mo, 200 mins), Pro ($75/mo, 27 hours).

Podcast/audiobook creators, YouTubers, AI customer service developers, game dev teams, and educational content companies.

Alternatives to Fish Audio S2 include: ElevenLabs, OpenAI TTS, Google Cloud TTS, MiniMax Speech, Seed-TTS..

Data source: ProductHuntMar 16, 2026
Last updated: