What is Qwen3.5 Small?

Alibaba's 4 new edge models (0.8B-9B), where the 9B version outperforms 120B giants in multiple benchmarks.

What are the main features of Qwen3.5 Small?

The main features of Qwen3.5 Small include: Native multimodal support, 262K long context window, Support for 201 languages, Multi-Token Prediction for acceleration.

How much does Qwen3.5 Small cost?

Open-source local version is completely free; Alibaba Cloud API is pay-as-you-go; third-party hosting is roughly $0.05-0.30/M tokens.

Who is Qwen3.5 Small for?

Developers running local AI, edge/embedded AI teams, indie hackers, and privacy-conscious enterprise users.

What are the alternatives to Qwen3.5 Small?

Alternatives to Qwen3.5 Small include: GPT-OSS-120B, Gemma 3 27B, Llama 4, Phi-4, Mistral 24B..

Qwen3.5 Small: 9B Parameters Crushing 120B—The "iPhone Moment" for Edge AI is Here

2026-03-04 | ProductHunt (301 votes) | GitHub | HuggingFace

30-Second Quick Judgment

What is it?: A series of 4 "small" models (0.8B/2B/4B/9B) released by Alibaba's Tongyi Qwen team. They run on phones and laptops, natively support text+image+video, and the 9B model outperforms OpenAI's GPT-OSS-120B on several benchmarks.

Is it worth your attention?: Absolutely. This isn't just "another small model"—it represents a fundamental shift in the industry: moving from "stacking parameters" to "increasing density." Even Elon Musk noted its "impressive intelligence density." It's open-source under Apache 2.0 and costs nothing to start using.

Three Questions That Matter

Is it for me?

Who is the target user?:

Developers who want to run AI locally (no API fees, no data in the cloud).
Teams building edge/embedded AI products (mobile apps, IoT, automotive).
Indie hackers needing multilingual + multimodal capabilities.
Privacy-sensitive enterprise and individual users.

Is that me?: If you fit any of these scenarios, yes:

You want a "private ChatGPT" on your Mac/PC.
You're building an AI product but are getting crushed by API costs.
You need automated workflows for documents, images, or videos.
You want to add local AI to an app without relying on the cloud.

When would I use it?:

Local Code Assistant → Use the 9B model with OpenCode CLI for lightweight programming.
Document Parsing → The 9B model scored 87.7 on OmniDocBench, crushing everything in its class.
Mobile Video Understanding → The 0.8B/2B models can analyze 60-second videos offline on an iPhone.
Privacy-Sensitive Tasks → Data never leaves your machine.

Is it useful?

Dimension	Benefit	Cost
Time	Eliminates API latency; local inference at 80+ tok/s	30-60 mins for initial setup
Money	Completely free; saves $240-600/year in API subs	Requires a 16GB VRAM GPU or 32GB RAM Mac
Effort	One-click run with `ollama run qwen3.5:9b`	Thinking mode and tool calling require some troubleshooting

ROI Judgment: If you have a 16GB GPU or an M-series Mac, this is essentially "free" productivity—local, powerful, and no cost. However, if you expect it to replace Claude Opus 4.6 or GPT-5 for complex reasoning, you'll be disappointed. The highest ROI comes from using it as a "local execution layer" paired with a cloud-based "planning layer."

Is it fun?

The "Wow" Factor:

9B vs 120B: The numbers alone are exciting. Beating a model 13x its size on benchmarks proves that architectural innovation beats raw parameter count.
Runs on a Phone: The 0.8B model runs on an iPhone. Imagine a truly offline AI assistant.
One Model for Everything: Text, images, and video all use the same weights—no need to stitch different models together.

The "Aha!" Moments:

"First model that runs fast locally and it could actually be useful for some straightforward tasks." — @Joseph_Richard7

"I've started Copaw locally using Ollama with the Qwen 3.5-9B model in a CPU-only setup. It works surprisingly well on 32GB of RAM." — @olekslev69

Real Talk/Complaints:

"Qwen 3.5 9B running on a 16GB Mac mini. Took about 32 seconds to respond to me saying 'hi'. lol. unusable." — @DNormandin1234

"Just gave Qwen 3.5 9B a try, and it spent like 7 paragraphs of thinking trying to understand a simple sentence..." — @thetechnocrat0

For Indie Hackers

Tech Stack

Architecture: Hybrid Attention = Gated Delta Networks (Linear Attention) + Full Attention, in a 3:1 ratio.
MoE: Sparse Mixture of Experts; the 35B-A3B version only activates 8.6% of parameters.
Multimodal: Early Fusion training, DeepStack Vision Transformer, Conv3d for video processing.
Training: Scaled Reinforcement Learning (RL), not just traditional SFT.
Inference Frameworks: vLLM / SGLang / llama.cpp / Ollama / MLX.

Core Implementation

Qwen3.5's breakthrough lies in replacing 75% of attention layers with Gated DeltaNet. Traditional Transformer attention is O(n^2) complexity; DeltaNet drops it to O(n). Each linear attention layer compresses the input sequence into a fixed-size state, using a gated decay mechanism from Mamba2 and hidden state updates from the Delta Rule. One full attention layer is kept every 4 layers to maintain "associative memory."

Result: Decoding speed is 8.6x faster than Qwen3-Max at 32K context, and 19x faster at 256K.

Open Source Status

License: Apache 2.0 (Commercial use, modification, and distribution allowed).
Weights: Available on HuggingFace + ModelScope (Instruct and Base versions).
Ecosystem: Over 180,000 derivative models globally—more than double its closest competitor.
Difficulty to Build Yourself: High. The hybrid DeltaNet + MoE architecture requires deep systems engineering and massive data. However, it's perfect for fine-tuning.

Business Model

Free Model: Apache 2.0, use it however you want.
Alibaba's Monetization: Charging for Alibaba Cloud API calls + Cloud Infrastructure. Cloud revenue grew 34% YoY in Q2, with AI product revenue seeing 8 consecutive quarters of double-digit growth.
Strategy: The classic "Open source the ecosystem → Ecosystem feeds the Cloud" play, similar to Meta's Llama strategy.

Giant Risk

Qwen is a product of a giant. For indie hackers building on Qwen:

The Good: Apache 2.0 means you won't be "cut off." Even if Alibaba stops development, the community can take over.
The Bad: Google (Gemma), Meta (Llama), and OpenAI (GPT-OSS) are all in this race. The window for model differentiation is very narrow.
Advice: Don't bet on a single model; build your architecture to allow for easy model switching.

For Product Managers

Pain Point Analysis

Problem Solved: Enterprises and devs need powerful AI on the edge/locally, but big models are too heavy and small models are too "dumb."
How big is the pain?: High-frequency demand. By 2026, over 2 billion smartphones will run local SLMs. 75% of enterprise AI deployments use local models for sensitive data. Edge AI is the fastest-growing segment (27.25% CAGR).

User Personas

Persona	Scenario	Which one to pick?
Mobile App Dev	Embedding offline AI in iOS/Android	0.8B / 2B
Full-stack Indie Hacker	Local AI Assistant / Code Copilot	9B
Enterprise IT	Internal doc parsing, compliance audits	4B / 9B
AI Researcher	Rapid prototyping, fine-tuning experiments	0.8B / 2B

Feature Breakdown

Feature	Type	Description
Native Multimodal	Core	Not stitched together; trained via early fusion
262K Context Window	Core	Available even in the 2B model; rare for small models
201 Language Support	Core	248K vocabulary covers almost everything
Multi-Token Prediction	Core	Speeds up inference
Pixel-level UI Interaction	Bonus	Can navigate desktop/mobile UIs
Thinking Mode (CoT)	Bonus	Off by default, can be enabled manually

Competitive Differentiation

vs	Qwen3.5-9B	GPT-OSS-120B	Gemma 3 27B	Llama 4
Parameters	9B	120B	27B	Various
GPQA Diamond	81.7	71.5	42.4	-
MMMU-Pro	70.1	59.7	-	-
Local Run	Laptop	Needs Cluster	Single GPU	Single GPU
Multimodal	Native Fusion	Text-heavy	Vision-capable	Vision-capable
License	Apache 2.0	Restricted	Restricted	Restricted

Key Takeaways

"Less is More" Positioning: Instead of claiming to be the "biggest," they claim to be "smarter and more efficient," hitting a real market need.
Aggressive Release Cadence: 9 models in 16 days creates massive exposure and keeps the conversation going.
Layered Model Matrix: Coverage from 0.8B to 397B, with each size mapped to a specific deployment scenario.
Open Source as Marketing: Apache 2.0 allows global devs to try it for free, which eventually drives Alibaba Cloud revenue.

For Tech Bloggers

Founder/Team Story

Key Figure: Junyang Lin, Qwen Technical Lead.
Background: Joined Alibaba in 2019, joined the Qwen team in April 2023.
The Drama: Just one day after the Qwen3.5 Small launch (March 3rd), Junyang Lin announced he was "stepping down" on X. Colleagues called it "the end of an era." This is the perfect "hook" for a story.
Team Size: 100+ developers. According to Bloomberg, they occupy two floors of an Alibaba building. They've released 357 models in less than two years.

Controversies/Discussion Angles

Benchmark Padding?: Anthropic CEO Dario Amodei publicly questioned if Chinese models are "optimized for benchmarks but weaker in actual use."
Complex Task "Collapse": Community tests found that on Master-level coding tasks, the ELO dropped from 1550 to 1194.
The Departure: Why did the Tech Lead leave the day after a major launch? Was it a "mission accomplished" exit or internal friction?
9B vs 120B: Is it truly stronger, or were the benchmarks cherry-picked?

Hype Data

ProductHunt: 301 votes.
Elon Musk Like: "impressive intelligence density."
HuggingFace: 300M+ cumulative downloads, 180,000+ derivative models.
Media Coverage: Featured in VentureBeat, TechCrunch, CNBC, MarkTechPost, etc.

Content Suggestions

The "Drama" Angle: "The Tech Lead Quit the Day After Launch: The Internal AI War Behind Qwen3.5."
The "Deep Tech" Angle: "The Secret Weapon Behind 9B Beating 120B: What is Delta Network?" (Technical explainer on Gated DeltaNet).
The "Trend" Angle: Musk's endorsement + US-China AI rivalry + Edge AI as the next big thing.

For Early Adopters

Pricing Analysis

Tier	Price	Features	Is it enough?
Open Source (Local)	Free	All features	Yes, if you have the hardware
Alibaba Cloud API	Pay-as-you-go	Cloud-based	Convenient but has latency
Third-party (Together AI)	~$0.05-0.30/M tokens	Hosted inference	Best for those without a GPU

Getting Started Guide

Time to setup: 10 minutes (using Ollama).
Learning Curve: Low.
Steps:
1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
2. Pull the model: ollama run qwen3.5:9b (downloads ~6.6GB).
3. Start chatting—it's that simple.
4. (Optional) Enable thinking mode: Use llama-server with --chat-template-kwargs '{"enable_thinking":true}'.

Pitfalls and Complaints

Ollama tool calling is broken: Format mapping error; Ollama sends Hermes JSON, but the model was trained on Qwen3-Coder XML. Issue tracked here.
Thinking mode overthinks: It might spend 7 paragraphs "understanding" a simple question. Better to keep it off for daily tasks.
Mac Mini CPU is slow: A 16GB Mac Mini on pure CPU takes 32 seconds for the first token. You need a GPU or Apple Silicon's Metal acceleration.
MLX Framework KV Cache Crash: Apple Silicon users should watch out for mlx-lm bugs.

Safety and Privacy

Data Storage: Completely local, nothing goes to the cloud.
License: Apache 2.0, one of the most permissive licenses.
Censorship: As a Chinese model, there may be safety filters for certain topics.
Identity Confusion: Some reports of the model claiming to be "made by Google" before self-correcting in the Chain of Thought.

Alternatives

Alternative	Advantage	Disadvantage
Gemma 3 27B	Google ecosystem, 140+ languages	Much weaker reasoning (GPQA is 43 points lower)
Llama 4 Scout	Meta ecosystem, huge community	Multimodal isn't as native as Qwen
Phi-4 (Microsoft)	Small and sharp, strong reasoning	Smaller ecosystem, license restrictions
Mistral 24B	European roots, stable general ability	No native multimodality

For Investors

Market Analysis

SLM Market Size: $7.76B in 2023 → $20.7B by 2030 (15.1% CAGR).
Edge AI Growth: 27.25% CAGR, the fastest-growing AI deployment method.
Total Market: Global LLM market ~$100B in 2026, projected $179.9B by 2035.
Drivers: Tightening privacy laws + improved edge compute + API cost pressure + offline needs.

Competitive Landscape

Tier	Players	Positioning
Top (Closed)	OpenAI, Anthropic, Google	Frontier Large Models
Top (Open)	Alibaba Qwen, Meta Llama	Open Source Ecosystem Leaders
Mid-tier	Mistral, Zhipu GLM	Differentiated positioning
Edge Specialists	Google Gemma, Microsoft Phi	Small model optimization
New Entrant	Qwen3.5 Small	Filling the Qwen Edge Gap

Timing Analysis

Why now?: 2026 is the SLM inflection point—2B+ phones running local SLMs. New architectures like Gated DeltaNet make "small models beating big models" a reality.
Tech Maturity: Architectural innovations (DeltaNet + MoE) are proven, not just lab experiments.
Market Readiness: Ollama's monthly active users have surpassed 10 million; local AI infrastructure is mature.

Team Background

Parent Company: Alibaba Group (NYSE: BABA).
AI Investment: $53.2B over 3 years; single-quarter CapEx of 38.6 billion RMB.
Team Size: 100+ people, 357 models released in two years.
Track Record: World's largest open-source model family, 300M+ downloads.
Risk Signal: Technical Lead Junyang Lin resigned on March 3rd.

Financials

Alibaba Cloud Q2 Revenue: $5.59B (+34% YoY).
Annualized Run Rate: >$22B.
AI Product Revenue: Double-digit growth for 8 consecutive quarters.
Not a standalone startup: Qwen is a strategic weapon for Alibaba Cloud, not a separate fundraising entity.

Conclusion

The Bottom Line: Qwen3.5 Small is the most significant edge AI release of March 2026. It proves that "9B parameters beating 120B" isn't hype—it's a victory for architectural innovation. For indie hackers, this is the latest version of a "free lunch."

User Type	Recommendation
Developers	Must try. `ollama run qwen3.5:9b`, 10 mins to start, Apache 2.0. Just don't expect it to replace Opus 4.6 for complex logic.
Product Managers	Worth watching. The "Small + Multimodal + Edge" combo sets a new SLM benchmark.
Bloggers	Great material. Musk's like + Tech Lead's exit + 9B vs 120B is at least three articles.
Early Adopters	Give it a go. Completely free, runs on 6.6GB, but tool calling and thinking mode still have bugs.
Investors	Keep tracking. SLM + Edge AI is a certain trend, but Qwen isn't an investable startup—watch Alibaba's overall AI strategy.

Resource Links

Resource	Link
GitHub	https://github.com/QwenLM/Qwen3.5
HuggingFace (9B)	https://huggingface.co/Qwen/Qwen3.5-9B
Ollama	https://ollama.com/library/qwen3.5:9b
Tech Blog	https://qwenlm.github.io/blog/qwen3.5/
ProductHunt	https://www.producthunt.com/products/qwen3
VentureBeat Report	https://venturebeat.com/technology/alibabas-small-open-source-qwen3-5-9b-beats-openais-gpt-oss-120b-and-can-run
TechCrunch (Lin Resignation)	https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-steps-down-after-major-ai-push/

2026-03-04 | Trend-Tracker v7.3

Qwen3.5 Small