Back to Explore

Qwen3.5 Small

LLMs

0.8B-9B native multimodal w/ more intelligence, less compute

💡 Qwen3.5 Small is the latest breakthrough in the large language model series developed by the Qwen team at Alibaba Cloud, focusing on high-density intelligence for edge devices.

"Qwen3.5 Small is like a pocket-sized Bruce Lee: compact, lightning-fast, and capable of knocking out heavyweights ten times its size with pure technical precision."

30-Second Verdict
What is it: Alibaba's 4 new edge models (0.8B-9B), where the 9B version outperforms 120B giants in multiple benchmarks.
Worth attention: A must-watch. It represents the industry shift from 'parameter hoarding' to 'intelligence density.' Open source under Apache 2.0.
8/10

Hype

9/10

Utility

301

Votes

Product Profile
Full Analysis Report

Qwen3.5 Small: 9B Parameters Crushing 120B—The "iPhone Moment" for Edge AI is Here

2026-03-04 | ProductHunt (301 votes) | GitHub | HuggingFace


30-Second Quick Judgment

What is it?: A series of 4 "small" models (0.8B/2B/4B/9B) released by Alibaba's Tongyi Qwen team. They run on phones and laptops, natively support text+image+video, and the 9B model outperforms OpenAI's GPT-OSS-120B on several benchmarks.

Is it worth your attention?: Absolutely. This isn't just "another small model"—it represents a fundamental shift in the industry: moving from "stacking parameters" to "increasing density." Even Elon Musk noted its "impressive intelligence density." It's open-source under Apache 2.0 and costs nothing to start using.


Three Questions That Matter

Is it for me?

Who is the target user?:

  • Developers who want to run AI locally (no API fees, no data in the cloud).
  • Teams building edge/embedded AI products (mobile apps, IoT, automotive).
  • Indie hackers needing multilingual + multimodal capabilities.
  • Privacy-sensitive enterprise and individual users.

Is that me?: If you fit any of these scenarios, yes:

  • You want a "private ChatGPT" on your Mac/PC.
  • You're building an AI product but are getting crushed by API costs.
  • You need automated workflows for documents, images, or videos.
  • You want to add local AI to an app without relying on the cloud.

When would I use it?:

  • Local Code Assistant → Use the 9B model with OpenCode CLI for lightweight programming.
  • Document Parsing → The 9B model scored 87.7 on OmniDocBench, crushing everything in its class.
  • Mobile Video Understanding → The 0.8B/2B models can analyze 60-second videos offline on an iPhone.
  • Privacy-Sensitive Tasks → Data never leaves your machine.

Is it useful?

DimensionBenefitCost
TimeEliminates API latency; local inference at 80+ tok/s30-60 mins for initial setup
MoneyCompletely free; saves $240-600/year in API subsRequires a 16GB VRAM GPU or 32GB RAM Mac
EffortOne-click run with ollama run qwen3.5:9bThinking mode and tool calling require some troubleshooting

ROI Judgment: If you have a 16GB GPU or an M-series Mac, this is essentially "free" productivity—local, powerful, and no cost. However, if you expect it to replace Claude Opus 4.6 or GPT-5 for complex reasoning, you'll be disappointed. The highest ROI comes from using it as a "local execution layer" paired with a cloud-based "planning layer."

Is it fun?

The "Wow" Factor:

  • 9B vs 120B: The numbers alone are exciting. Beating a model 13x its size on benchmarks proves that architectural innovation beats raw parameter count.
  • Runs on a Phone: The 0.8B model runs on an iPhone. Imagine a truly offline AI assistant.
  • One Model for Everything: Text, images, and video all use the same weights—no need to stitch different models together.

The "Aha!" Moments:

"First model that runs fast locally and it could actually be useful for some straightforward tasks." — @Joseph_Richard7

"I've started Copaw locally using Ollama with the Qwen 3.5-9B model in a CPU-only setup. It works surprisingly well on 32GB of RAM." — @olekslev69

Real Talk/Complaints:

"Qwen 3.5 9B running on a 16GB Mac mini. Took about 32 seconds to respond to me saying 'hi'. lol. unusable." — @DNormandin1234

"Just gave Qwen 3.5 9B a try, and it spent like 7 paragraphs of thinking trying to understand a simple sentence..." — @thetechnocrat0


For Indie Hackers

Tech Stack

  • Architecture: Hybrid Attention = Gated Delta Networks (Linear Attention) + Full Attention, in a 3:1 ratio.
  • MoE: Sparse Mixture of Experts; the 35B-A3B version only activates 8.6% of parameters.
  • Multimodal: Early Fusion training, DeepStack Vision Transformer, Conv3d for video processing.
  • Training: Scaled Reinforcement Learning (RL), not just traditional SFT.
  • Inference Frameworks: vLLM / SGLang / llama.cpp / Ollama / MLX.

Core Implementation

Qwen3.5's breakthrough lies in replacing 75% of attention layers with Gated DeltaNet. Traditional Transformer attention is O(n^2) complexity; DeltaNet drops it to O(n). Each linear attention layer compresses the input sequence into a fixed-size state, using a gated decay mechanism from Mamba2 and hidden state updates from the Delta Rule. One full attention layer is kept every 4 layers to maintain "associative memory."

Result: Decoding speed is 8.6x faster than Qwen3-Max at 32K context, and 19x faster at 256K.

Open Source Status

  • License: Apache 2.0 (Commercial use, modification, and distribution allowed).
  • Weights: Available on HuggingFace + ModelScope (Instruct and Base versions).
  • Ecosystem: Over 180,000 derivative models globally—more than double its closest competitor.
  • Difficulty to Build Yourself: High. The hybrid DeltaNet + MoE architecture requires deep systems engineering and massive data. However, it's perfect for fine-tuning.

Business Model

  • Free Model: Apache 2.0, use it however you want.
  • Alibaba's Monetization: Charging for Alibaba Cloud API calls + Cloud Infrastructure. Cloud revenue grew 34% YoY in Q2, with AI product revenue seeing 8 consecutive quarters of double-digit growth.
  • Strategy: The classic "Open source the ecosystem → Ecosystem feeds the Cloud" play, similar to Meta's Llama strategy.

Giant Risk

Qwen is a product of a giant. For indie hackers building on Qwen:

  • The Good: Apache 2.0 means you won't be "cut off." Even if Alibaba stops development, the community can take over.
  • The Bad: Google (Gemma), Meta (Llama), and OpenAI (GPT-OSS) are all in this race. The window for model differentiation is very narrow.
  • Advice: Don't bet on a single model; build your architecture to allow for easy model switching.

For Product Managers

Pain Point Analysis

  • Problem Solved: Enterprises and devs need powerful AI on the edge/locally, but big models are too heavy and small models are too "dumb."
  • How big is the pain?: High-frequency demand. By 2026, over 2 billion smartphones will run local SLMs. 75% of enterprise AI deployments use local models for sensitive data. Edge AI is the fastest-growing segment (27.25% CAGR).

User Personas

PersonaScenarioWhich one to pick?
Mobile App DevEmbedding offline AI in iOS/Android0.8B / 2B
Full-stack Indie HackerLocal AI Assistant / Code Copilot9B
Enterprise ITInternal doc parsing, compliance audits4B / 9B
AI ResearcherRapid prototyping, fine-tuning experiments0.8B / 2B

Feature Breakdown

FeatureTypeDescription
Native MultimodalCoreNot stitched together; trained via early fusion
262K Context WindowCoreAvailable even in the 2B model; rare for small models
201 Language SupportCore248K vocabulary covers almost everything
Multi-Token PredictionCoreSpeeds up inference
Pixel-level UI InteractionBonusCan navigate desktop/mobile UIs
Thinking Mode (CoT)BonusOff by default, can be enabled manually

Competitive Differentiation

vsQwen3.5-9BGPT-OSS-120BGemma 3 27BLlama 4
Parameters9B120B27BVarious
GPQA Diamond81.771.542.4-
MMMU-Pro70.159.7--
Local RunLaptopNeeds ClusterSingle GPUSingle GPU
MultimodalNative FusionText-heavyVision-capableVision-capable
LicenseApache 2.0RestrictedRestrictedRestricted

Key Takeaways

  1. "Less is More" Positioning: Instead of claiming to be the "biggest," they claim to be "smarter and more efficient," hitting a real market need.
  2. Aggressive Release Cadence: 9 models in 16 days creates massive exposure and keeps the conversation going.
  3. Layered Model Matrix: Coverage from 0.8B to 397B, with each size mapped to a specific deployment scenario.
  4. Open Source as Marketing: Apache 2.0 allows global devs to try it for free, which eventually drives Alibaba Cloud revenue.

For Tech Bloggers

Founder/Team Story

  • Key Figure: Junyang Lin, Qwen Technical Lead.
  • Background: Joined Alibaba in 2019, joined the Qwen team in April 2023.
  • The Drama: Just one day after the Qwen3.5 Small launch (March 3rd), Junyang Lin announced he was "stepping down" on X. Colleagues called it "the end of an era." This is the perfect "hook" for a story.
  • Team Size: 100+ developers. According to Bloomberg, they occupy two floors of an Alibaba building. They've released 357 models in less than two years.

Controversies/Discussion Angles

  1. Benchmark Padding?: Anthropic CEO Dario Amodei publicly questioned if Chinese models are "optimized for benchmarks but weaker in actual use."
  2. Complex Task "Collapse": Community tests found that on Master-level coding tasks, the ELO dropped from 1550 to 1194.
  3. The Departure: Why did the Tech Lead leave the day after a major launch? Was it a "mission accomplished" exit or internal friction?
  4. 9B vs 120B: Is it truly stronger, or were the benchmarks cherry-picked?

Hype Data

  • ProductHunt: 301 votes.
  • Elon Musk Like: "impressive intelligence density."
  • HuggingFace: 300M+ cumulative downloads, 180,000+ derivative models.
  • Media Coverage: Featured in VentureBeat, TechCrunch, CNBC, MarkTechPost, etc.

Content Suggestions

  • The "Drama" Angle: "The Tech Lead Quit the Day After Launch: The Internal AI War Behind Qwen3.5."
  • The "Deep Tech" Angle: "The Secret Weapon Behind 9B Beating 120B: What is Delta Network?" (Technical explainer on Gated DeltaNet).
  • The "Trend" Angle: Musk's endorsement + US-China AI rivalry + Edge AI as the next big thing.

For Early Adopters

Pricing Analysis

TierPriceFeaturesIs it enough?
Open Source (Local)FreeAll featuresYes, if you have the hardware
Alibaba Cloud APIPay-as-you-goCloud-basedConvenient but has latency
Third-party (Together AI)~$0.05-0.30/M tokensHosted inferenceBest for those without a GPU

Getting Started Guide

  • Time to setup: 10 minutes (using Ollama).
  • Learning Curve: Low.
  • Steps:
    1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
    2. Pull the model: ollama run qwen3.5:9b (downloads ~6.6GB).
    3. Start chatting—it's that simple.
    4. (Optional) Enable thinking mode: Use llama-server with --chat-template-kwargs '{"enable_thinking":true}'.

Pitfalls and Complaints

  1. Ollama tool calling is broken: Format mapping error; Ollama sends Hermes JSON, but the model was trained on Qwen3-Coder XML. Issue tracked here.
  2. Thinking mode overthinks: It might spend 7 paragraphs "understanding" a simple question. Better to keep it off for daily tasks.
  3. Mac Mini CPU is slow: A 16GB Mac Mini on pure CPU takes 32 seconds for the first token. You need a GPU or Apple Silicon's Metal acceleration.
  4. MLX Framework KV Cache Crash: Apple Silicon users should watch out for mlx-lm bugs.

Safety and Privacy

  • Data Storage: Completely local, nothing goes to the cloud.
  • License: Apache 2.0, one of the most permissive licenses.
  • Censorship: As a Chinese model, there may be safety filters for certain topics.
  • Identity Confusion: Some reports of the model claiming to be "made by Google" before self-correcting in the Chain of Thought.

Alternatives

AlternativeAdvantageDisadvantage
Gemma 3 27BGoogle ecosystem, 140+ languagesMuch weaker reasoning (GPQA is 43 points lower)
Llama 4 ScoutMeta ecosystem, huge communityMultimodal isn't as native as Qwen
Phi-4 (Microsoft)Small and sharp, strong reasoningSmaller ecosystem, license restrictions
Mistral 24BEuropean roots, stable general abilityNo native multimodality

For Investors

Market Analysis

  • SLM Market Size: $7.76B in 2023 → $20.7B by 2030 (15.1% CAGR).
  • Edge AI Growth: 27.25% CAGR, the fastest-growing AI deployment method.
  • Total Market: Global LLM market ~$100B in 2026, projected $179.9B by 2035.
  • Drivers: Tightening privacy laws + improved edge compute + API cost pressure + offline needs.

Competitive Landscape

TierPlayersPositioning
Top (Closed)OpenAI, Anthropic, GoogleFrontier Large Models
Top (Open)Alibaba Qwen, Meta LlamaOpen Source Ecosystem Leaders
Mid-tierMistral, Zhipu GLMDifferentiated positioning
Edge SpecialistsGoogle Gemma, Microsoft PhiSmall model optimization
New EntrantQwen3.5 SmallFilling the Qwen Edge Gap

Timing Analysis

  • Why now?: 2026 is the SLM inflection point—2B+ phones running local SLMs. New architectures like Gated DeltaNet make "small models beating big models" a reality.
  • Tech Maturity: Architectural innovations (DeltaNet + MoE) are proven, not just lab experiments.
  • Market Readiness: Ollama's monthly active users have surpassed 10 million; local AI infrastructure is mature.

Team Background

  • Parent Company: Alibaba Group (NYSE: BABA).
  • AI Investment: $53.2B over 3 years; single-quarter CapEx of 38.6 billion RMB.
  • Team Size: 100+ people, 357 models released in two years.
  • Track Record: World's largest open-source model family, 300M+ downloads.
  • Risk Signal: Technical Lead Junyang Lin resigned on March 3rd.

Financials

  • Alibaba Cloud Q2 Revenue: $5.59B (+34% YoY).
  • Annualized Run Rate: >$22B.
  • AI Product Revenue: Double-digit growth for 8 consecutive quarters.
  • Not a standalone startup: Qwen is a strategic weapon for Alibaba Cloud, not a separate fundraising entity.

Conclusion

The Bottom Line: Qwen3.5 Small is the most significant edge AI release of March 2026. It proves that "9B parameters beating 120B" isn't hype—it's a victory for architectural innovation. For indie hackers, this is the latest version of a "free lunch."

User TypeRecommendation
DevelopersMust try. ollama run qwen3.5:9b, 10 mins to start, Apache 2.0. Just don't expect it to replace Opus 4.6 for complex logic.
Product ManagersWorth watching. The "Small + Multimodal + Edge" combo sets a new SLM benchmark.
BloggersGreat material. Musk's like + Tech Lead's exit + 9B vs 120B is at least three articles.
Early AdoptersGive it a go. Completely free, runs on 6.6GB, but tool calling and thinking mode still have bugs.
InvestorsKeep tracking. SLM + Edge AI is a certain trend, but Qwen isn't an investable startup—watch Alibaba's overall AI strategy.

Resource Links

ResourceLink
GitHubhttps://github.com/QwenLM/Qwen3.5
HuggingFace (9B)https://huggingface.co/Qwen/Qwen3.5-9B
Ollamahttps://ollama.com/library/qwen3.5:9b
Tech Bloghttps://qwenlm.github.io/blog/qwen3.5/
ProductHunthttps://www.producthunt.com/products/qwen3
VentureBeat Reporthttps://venturebeat.com/technology/alibabas-small-open-source-qwen3-5-9b-beats-openais-gpt-oss-120b-and-can-run
TechCrunch (Lin Resignation)https://techcrunch.com/2026/03/03/alibabas-qwen-tech-lead-steps-down-after-major-ai-push/

2026-03-04 | Trend-Tracker v7.3

One-line Verdict

Qwen3.5 Small is a milestone for Edge AI in 2026, achieving a performance leap through architectural innovation. It is a powerful, free local productivity tool that sets a new benchmark for SLMs.

FAQ

Frequently Asked Questions about Qwen3.5 Small

Alibaba's 4 new edge models (0.8B-9B), where the 9B version outperforms 120B giants in multiple benchmarks.

The main features of Qwen3.5 Small include: Native multimodal support, 262K long context window, Support for 201 languages, Multi-Token Prediction for acceleration.

Open-source local version is completely free; Alibaba Cloud API is pay-as-you-go; third-party hosting is roughly $0.05-0.30/M tokens.

Developers running local AI, edge/embedded AI teams, indie hackers, and privacy-conscious enterprise users.

Alternatives to Qwen3.5 Small include: GPT-OSS-120B, Gemma 3 27B, Llama 4, Phi-4, Mistral 24B..

Data source: ProductHuntMar 4, 2026
Last updated: