What are the main features of Fish Audio S2?

The main features of Fish Audio S2 include: Inline Tags for natural language emotion control, Zero-Shot 10-second voice cloning, Multi-character dialogue generation in a single pass, 80+ language support.

How much does Fish Audio S2 cost?

Free (7 mins/mo), Plus ($11/mo, 200 mins), Pro ($75/mo, 27 hours).

Who is Fish Audio S2 for?

Podcast/audiobook creators, YouTubers, AI customer service developers, game dev teams, and educational content companies.

What are the alternatives to Fish Audio S2?

Alternatives to Fish Audio S2 include: ElevenLabs, OpenAI TTS, Google Cloud TTS, MiniMax Speech, Seed-TTS..

Fish Audio S2: The 'Emotional Revolution' in Open-Source TTS, ElevenLabs' Strongest Challenger

2026-03-10 | ProductHunt · Official Site · GitHub

30-Second Quick Judgment

What does it do?: An open-source AI voice synthesis model. If you type [whisper], [laugh], or [sigh] in your text, it actually whispers, laughs, or sighs. It supports 80+ languages and can clone your voice with just a 10-second reference clip.

Is it worth your attention?: Absolutely. This is currently the top-performing TTS model in benchmarks, beating out OpenAI's gpt-4o-mini-tts, ByteDance's Seed-TTS, and MiniMax Speech. More importantly—its API is only 1/4 the price of ElevenLabs. If you're into podcasts, audiobooks, game dubbing, or AI customer service, this will directly disrupt your cost structure.

Three Questions: Why it matters to you

Is it relevant to me?

Target users:

Podcast/audiobook content creators
YouTubers needing multilingual dubbing
Developers building AI voice assistants or customer service
Game dev teams needing NPC dialogue
Educational content companies

Am I the target?: If you've ever needed to turn text into speech—whether for translation, podcasting, or game characters—you are the target user.

When would I use it?:

Producing daily multilingual podcast content → Use the S2 API for batch generation.
Creating an emotional AI customer service agent → Use inline tags to control tone.
Needing multi-character dialogue in a game → Generate multiple roles at once without separate recordings.
Tired of paying ElevenLabs $99/month → S2 Pro is only $75/month and gives you 7x more time.

Is it actually useful?

Dimension	Benefit	Cost
Time	Generate multi-character dialogue at once, saving time on separate recording/synthesis	~30 mins to learn Inline Tags syntax
Money	API is 75%+ cheaper than ElevenLabs; self-hosting is nearly free	Starts at $11/mo for Plus; self-hosting needs a 24GB GPU
Effort	Generate voice with just 3 lines of code via SDK	Self-hosting requires tinkering with CUDA/SGLang

ROI Judgment: If you currently use ElevenLabs and spend over $50/month, switching now will save you a fortune. If you only use TTS occasionally, the free 7 mins/month is plenty for a trial.

Why you'll love it

The "Aha!" moments:

Emotion control is magic: Type [whisper] and it actually lowers its voice; type [sigh] and it actually sighs. It’s not just a simple pitch shift.
Multi-role generation: No need to upload reference audio for every character separately; handle an entire conversation in one go.
80+ languages with zero config: No phoneme labeling needed; just drop in Chinese, Japanese, or Arabic and it speaks.

A "Wow" moment:

"I typed [laughing nervously], and the AI actually laughed. That's when I realized AI voice has truly grown up." — @anujcodes_21

Real User Reviews:

"The most expressive open-weight TTS model; voice cloning works flawlessly in Arabic, German, and English." — @fahdmirza

"AI voice cloning just got dangerous." — @hasantoxr

"Sound quality and naturalness are solid for TTS—perfect for real-time chat, multi-character stories, and long-form reading." — @aigclink

For Indie Developers

Tech Stack

Model Architecture: Dual-AR (Dual Auto-Regressive), based on a Qwen3 backbone
- Slow AR: 4B parameters, predicting semantic codebooks along the time axis
- Fast AR: 400M parameters, generating 9 residual codebooks per time step
Audio Encoding: RVQ-based codec, 10 codebooks, ~21 Hz frame rate
Post-training: GRPO (Group Relative Policy Optimization) reinforcement learning alignment
Inference Engine: SGLang (featuring continuous batching, paged KV cache, and CUDA graph replay)
Training Data: 10M+ hours of audio, 80+ languages
SDK: Python fish-audio-sdk / Node.js fish-audio-sdk

Core Feature Implementation

Simply put, S2's core innovation is turning "emotion control" into natural language instructions rather than fixed SSML tags. You can insert descriptions like [whisper], [excited], or [pitch up] anywhere in the text, and the model changes style at that exact spot. This is a hundred times more flexible than Google/Azure's <prosody rate="slow">.

Multi-character generation is also clever—using the <|speaker:i|> token to mark different speakers, you can handle an entire dialogue in a single inference pass without running them separately.

Open Source Status

Code: Apache 2.0 (True open source)
Model Weights: Fish Audio Research License (Free for research, commercial use requires authorization)
Here's the catch: Twitter community notes point out this isn't "true" open source; it's more accurately "source-available."
GitHub: fishaudio/fish-speech
HuggingFace: fishaudio/s2-pro
Technical Report: arxiv 2603.08823
Similar Projects: Coqui TTS (Discontinued), StyleTTS2, XTTS
Difficulty to replicate: Extremely high. A 4.4B parameter model + 10M hours of training data is nearly impossible to reproduce without a large-scale GPU cluster.

Business Model

Monetization: Usage-based API subscription
Pricing: Free $0 (7 mins/mo) → Plus $11/mo (200 mins) → Pro $75/mo (27 hours)
MAU: 420,000+ (Mid-2025)
ARR: $5M+ (April 2025)
Active Developers: 20,000+

Giant Risk

Medium-high. OpenAI has gpt-4o-mini-tts, Google has Cloud TTS, and Microsoft has Azure Speech. However, S2 beats these giants in benchmarks and has a more aggressive pricing strategy. The real risk isn't being shut down, but ElevenLabs following suit with price cuts. That said, Fish Audio's open-source ecosystem (built from So-VITS-SVC, GPT-SoVITS, etc.) is a strong moat.

For Product Managers

Pain Point Analysis

Problem Solved: Bridges the gap between cheap but robotic models (Google/Azure) and expressive but expensive ones (ElevenLabs).
Severity: High-frequency demand. Any scenario requiring "AI to speak" needs TTS, and users are increasingly intolerant of robotic tones.

User Persona

Primary Users: Developers (integrating into their own products), content creators (podcasts/audiobooks).
Secondary Users: Enterprise clients (customer service/IVR), game developers.
Use Cases: Batch content production, real-time conversational AI, multilingual localized dubbing.

Feature Breakdown

Feature	Type	Description
Inline Tags Emotion Control	Core	Natural language instructions to control tone, with 15,000+ descriptions
Zero-Shot Voice Cloning	Core	Clone a voice with 10-30 seconds of reference audio
Multi-Speaker Support	Core	Generate multi-person dialogue in a single pass
80+ Language Support	Core	No phoneme preprocessing required
<150ms Latency	Nice-to-have	Essential for real-time conversation scenarios
Self-hosting	Nice-to-have	For data-sensitive enterprises

Competitor Comparison

Dimension	Fish Audio S2	ElevenLabs	Google Cloud TTS	OpenAI TTS
Core Differentiator	Natural language emotion control	Mature voice marketplace ecosystem	Enterprise-grade stability	GPT ecosystem integration
Price	Pro $75/mo (27h)	Pro $99/mo (500k credits)	Per character	Per token
Open Source	Code open / Weights restricted	Fully closed	Fully closed	Fully closed
Pros	Cheap + Expressive + Open	Mature ecosystem + Variety	Stable & Reliable	Seamless GPT integration
Cons	License controversy, GPU heavy	Expensive	Weak expressiveness	Average expressiveness

Key Takeaways

Inline Tags Design: Embedding control instructions into the text itself rather than using a separate markup language significantly lowers the barrier to entry.
"Open Source First" Strategy: Building a community through So-VITS-SVC and GPT-SoVITS before monetizing via API has created a massive user base.
Benchmark-Driven Narrative: Instead of just saying "we're good," they use data to prove they are better than OpenAI and Google.

For Tech Bloggers

Founder Story

Shijia Liao (Leng Yue), Gen Z.
Former NVIDIA researcher with 7+ years in the AI audio field.
A legend in the open-source world—author/core contributor of viral projects like So-VITS-SVC, GPT-SoVITS, and Bert-VITS2.
After leaving NVIDIA, he started Fish Audio using an RTX 4090 GPU at home.
A 4-person Gen Z founding team that grew ARR from $400K to $5M in just 3 months.
Accepted into the HF0 incubator (a YC-level AI-specific incubator, $1M SAFE for 5%).

Story Angle: A Gen Z open-source master leaves NVIDIA to build a TTS model on a home 4090 that beats OpenAI and Google. It’s a classic underdog story.

Controversies/Discussion Angles

The "Pseudo-Open Source" Debate: Code is Apache, but weights have a non-commercial license. Twitter community notes have flagged this as "misleading."
AI Voice Ethics: The ability to clone anyone's voice in 10 seconds raises serious deepfake concerns.
China Team vs. US Registration: Registered in Delaware, but the team background and open-source roots are in China.
Open Source vs. Commercialization: How to find the balance between "letting everyone use it" and "making enough to sustain the team."

Hype Data

ProductHunt: 569 votes
Twitter: Retweeted by multiple influencers with 10k+ followers (Fahd Mirza, Hasan Toor, etc.).
LMSYS Official tweeted congratulations (one of the most authoritative orgs in LLM evaluation).
Arxiv Technical Report: Gaining significant academic attention.
Reddit r/LocalLLaMA: High-engagement discussions.

Content Suggestions

Angle: "From a 4090 to Beating OpenAI: The Startup Story of a Gen Z Open-Source TTS."
Trending Topics: AI voice cloning security discussions, the Open Source vs. Closed Source AI roadmap debate.
Video Direction: A side-by-side test of ElevenLabs vs. Fish Audio S2 to let the audience hear the difference.

For Early Adopters

Pricing Analysis

Tier	Price	Features	Is it enough?
Free	$0	7 mins/mo, 8000 credits, personal use only	Good for testing, not for production
Plus	$11/mo	200 mins, API access, commercial license	Enough for individual creators
Pro	$75/mo	27 hours, priority, 30,000 chars/request	Enough for medium-output teams

Vs. ElevenLabs: ElevenLabs Pro at $99/mo gives only 500k credits (~100 mins of high-quality voice). Fish Audio Pro at $75/mo gives 27 hours. The value for money is overwhelming.

Getting Started Guide

Setup Time: 5 mins (API) / 30 mins (Self-hosting)
Learning Curve: Low (API) / Medium-High (Self-hosting)
Steps:
1. Register at fish.audio.
2. Create an app to get your API Key.
3. pip install fish-audio-sdk
4. Three lines of code to generate voice:

from fishaudio import FishAudio
client = FishAudio(api_key="your_key")
audio = client.tts.convert(text="Hello [whisper] this is a secret [/whisper]")

Pitfalls and Complaints

High GPU barrier: Self-hosting needs at least 12GB VRAM, 24GB recommended. An RTX 3060 takes 15s for 1 min of audio.
"Open Source" with a catch: Commercial use of model weights requires a separate license—don't assume it's free for your product just because you can download it.
English naturalness: It won benchmarks, but to the human ear, some English voices aren't quite as natural as ElevenLabs yet.
Tag conflicts: Don't overdo it. Putting [whisper] and [excited] together might confuse the model.
Strict free tier: 7 mins/month is fine for a demo, but not for real work.

Security and Privacy

Data Storage: API calls go through the cloud (fish.audio servers); self-hosting is entirely local.
Privacy Policy: Registered in Delaware, complying with US laws.
Voice Cloning Risk: 10 seconds is all it takes. While the platform has terms of use, the technology itself cannot be fully prevented from abuse.

Alternatives

Alternative	Pros	Cons
ElevenLabs	Mature ecosystem, stable quality, large market	3-4x more expensive
OpenAI TTS	Great GPT ecosystem integration	Less expressive than S2
StyleTTS2	Fully free and open source	Performance lags behind S2
Bark	Free, supports non-speech sound effects	Quality lags behind S2
XTTS	Strong community (by Coqui)	Project is discontinued

For Investors

Market Analysis

Market Size: Global TTS market ~$4B in 2025, projected to reach $7.6-8.3B by 2030.
Growth Rate: CAGR from 12-16% (conservative) to 23% (optimistic).
Drivers: AI Agents needing to speak, the explosion of podcasts/audiobooks, accessibility needs, and automotive voice interaction.

Competitive Landscape

Tier	Players	Positioning
Top Tier	Microsoft, Google, ElevenLabs	Cloud TTS Services
Mid Tier	OpenAI, MiniMax, ByteDance (Seed-TTS)	AI-Native TTS
New Entrants	Fish Audio	Open Source Community + API Service

Fish Audio's position is unique—using open source to capture the market and API for monetization, similar to Hugging Face's path with NLP models.

Timing Analysis

Why Now?: The AI Agent wave has created a massive need for "AI that can talk"; falling inference costs allow 4.4B parameter models to be commercially viable.
Tech Maturity: Benchmarks show it has surpassed closed-source solutions, though production stability needs time to be proven.
Market Readiness: ElevenLabs has already educated the market; users know what AI voice can do but find it too expensive—Fish Audio is perfectly positioned to capture this demand.

Team Background

Founder: Shijia Liao (Leng Yue), former NVIDIA researcher with 7+ years in AI audio.
Core Team: 4-person Gen Z founding team.
Track Record: Projects like So-VITS-SVC and GPT-SoVITS have massive influence in the AI synthesis community.
Chief Scientist: Former NVIDIA + University of Maryland researcher.

Funding Status

Known Funding: HF0 incubator ($1M uncapped SAFE for 5%) + at least one pre-HF0 round.
Investors: Specific institutions not disclosed.
ARR: $5M+ (April 2025).
MAU: 420,000+.

Conclusion

Fish Audio S2 is one of the most significant AI voice releases of March 2026. With open-source code, top-tier benchmarks, and aggressive pricing, it is directly challenging ElevenLabs' dominance.

User Type	Recommendation
Developers	✅ Highly Recommended. Leading tech, cheap API, great SDK. Watch the weight license.
Product Managers	✅ Recommended. Inline Tags are a brilliant design choice; the pricing strategy is smart.
Bloggers	✅ Great for content. Gen Z founder vs. Giants + Open Source debate + AI Ethics.
Early Adopters	✅ Recommended. Start with the free tier; the $11/mo Plus tier is enough for most creators.
Investors	✅ Worth watching. $4B market, $5M ARR, 420K MAU—the growth flywheel is spinning.

Resource Links

Resource	Link
Official Site	fish.audio
S2 Product Page	fish.audio/s2
GitHub	fishaudio/fish-speech
HuggingFace	fishaudio/s2-pro
Technical Report	arxiv 2603.08823
API Docs	docs.fish.audio
Python SDK	fishaudio/fish-audio-python
ProductHunt	producthunt.com/products/fish-audio-s2
Twitter	@FishAudio
Founder LinkedIn	Shijia Liao
Blog	fish.audio/blog

2026-03-16 | Trend-Tracker v7.3

Fish Audio S2