๐Ÿ“š More on this topic: Qwen 2.5 Guide ยท VRAM Requirements ยท Best Models for Coding ยท Llama 4 Guide

Qwen3-4B matches Qwen 2.5-72B on benchmarks. Read that again.

A model that fits in 3GB of VRAM competes with one that needs 43GB. That’s not marketing โ€” it’s the actual benchmark data from Alibaba’s technical report, and it reflects a generational leap in what small models can do.

Qwen3 is the strongest open model family for local AI right now. Eight sizes from 0.6B to 235B, two MoE models that punch way above their weight, a /think toggle that no other family offers, and everything under Apache 2.0. This guide covers every model, what hardware you need, and which one to pick.


The Full Qwen3 Family

Dense Models

ModelParamsVRAM (Q4)OllamaBest For
Qwen3-0.6B0.6B~1GBollama run qwen3:0.6bPhones, Raspberry Pi, edge devices
Qwen3-1.7B1.7B~1.5GBollama run qwen3:1.7bEmbedded, Pi, basic classification
Qwen3-4B4B~3GBollama run qwen3:4bBest quality/VRAM ratio in any model family
Qwen3-8B8B~6GBollama run qwen3:8bGeneral-purpose on 8-12GB GPUs
Qwen3-14B14B~9GBollama run qwen3:14bSweet spot for instruction-following
Qwen3-32B32B~20GBollama run qwen3:32bBest dense model for 24GB GPUs

MoE Models

ModelTotal / ActiveVRAM (Q4)OllamaBest For
Qwen3-30B-A3B30B / 3B active~18GB*ollama run qwen3:30b-a3bHidden gem โ€” outperforms QwQ-32B
Qwen3-235B-A22B235B / 22B active~143GBollama run qwen3:235b-a22bFlagship โ€” competes with DeepSeek-R1, o1

*The 30B-A3B needs ~18GB at Q4, but Unsloth’s aggressive quants can squeeze it into 8GB. At Q4_K_M you’ll want 16-18GB.

Coding & Vision Variants

ModelTypeNotes
Qwen3-Coder 480B-A35BMoE codingAgentic coding monster, 256K context, needs 250GB+
Qwen3-Coder 30B-A3BMoE codingPractical coding model, fits in consumer hardware
Qwen3-VL 4B/8B/32BVisionImage understanding at each VRAM tier
Qwen3-VL 30B-A3BVision MoEMultimodal with MoE efficiency

All models are Apache 2.0 licensed โ€” genuinely open for commercial and personal use.


The Killer Feature: /think and /no_think

Every Qwen3 model has a built-in reasoning toggle. No other model family does this as cleanly.

/think mode โ€” chain-of-thought reasoning. The model works through the problem step by step before answering. Use for math, coding, logic puzzles, complex analysis.

/no_think mode โ€” fast response. Skip the reasoning chain and answer directly. Use for chat, simple questions, creative writing, translation.

You: /think What's the time complexity of mergesort and why?

Qwen3: <think>
Mergesort divides the array in half recursively...
Each level of recursion does O(n) work for merging...
There are log(n) levels...
</think>

O(n log n). Here's why: mergesort splits the array in half
at each level (log n levels), and each level merges all n
elements. That gives n * log(n) total operations...
You: /no_think Translate "good morning" to French

Qwen3: Bonjour

The toggle works per-turn in multi-turn conversations. You can use /think for one question and /no_think for the next. The model follows the most recent instruction.

This matters for local AI because you can use one model for both quick chat and deep reasoning โ€” no need to swap between a fast model and a thinking model.


Which Model for Which Hardware

This is the question that matters. Here’s the honest breakdown:

4GB VRAM (What can you run?)

ModelQuantVRAM UsedSpeedVerdict
Qwen3-0.6BQ4_K_M~1GB40-60 tok/sFast but limited
Qwen3-1.7BQ4_K_M~1.5GB30-45 tok/sBetter quality, still quick
Qwen3-4BQ4_K_M~3GB20-35 tok/sBest you can get at this tier

Pick: Qwen3-4B. It fits, it’s fast, and its benchmark scores embarrass models 10x its size.

8GB VRAM (What can you run?)

ModelQuantVRAM UsedSpeedVerdict
Qwen3-4BQ8_0~5GB25-35 tok/sHigher quality from same model
Qwen3-8BQ4_K_M~6GB20-30 tok/sSolid general-purpose
Qwen3-30B-A3B1.78-bit~8GB15-25 tok/sMoE gem, punches way above

Pick: Qwen3-8B for general use, or Qwen3-30B-A3B (Unsloth quant) if you want MoE magic. The 30B-A3B outperforms QwQ-32B despite activating only 3B parameters per token.

12GB VRAM (What can you run?)

ModelQuantVRAM UsedSpeedVerdict
Qwen3-8BQ6_K~8GB22-30 tok/sHigher quant, better quality
Qwen3-14BQ4_K_M~9GB18-25 tok/sBest instruction-following

Pick: Qwen3-14B. This is the sweet spot. Strong instruction-following, great at structured output and tool calling, and it fits comfortably with room for context.

16GB VRAM (What can you run?)

ModelQuantVRAM UsedSpeedVerdict
Qwen3-14BQ6_K~12GB18-25 tok/sHigher quant = better quality
Qwen3-30B-A3BQ4_K_M~18GBTight fitMoE model, needs most of the VRAM
Qwen3-32BQ4_K_M~20GBWon’t fitOver budget โ€” need 24GB

Pick: Qwen3-14B at Q6_K. You get the best quality-per-VRAM at this tier. The 30B MoE at Q4 is tight โ€” you’ll sacrifice context window.

24GB VRAM (What can you run?)

ModelQuantVRAM UsedSpeedVerdict
Qwen3-32BQ4_K_M~20GB15-22 tok/sBest dense model for 24GB
Qwen3-30B-A3BQ4_K_M~18GB20-30 tok/sFaster inference, headroom for context

Pick: Qwen3-32B for best quality, Qwen3-30B-A3B for faster inference. The 32B is a dense model โ€” all 32B params work on every token. The 30B MoE is faster per token (only 3B active) but the 32B edges it on raw quality.

CPU Only

ModelQuantRAM NeededSpeedVerdict
Qwen3-4BQ4_K_M~4GB5-8 tok/sUsable for chat
Qwen3-8BQ4_K_M~8GB3-5 tok/sSlow but functional

CPU inference is always slow but Qwen3-4B at Q4 is genuinely usable at 5-8 tok/s. Acceptable for interactive chat on a decent CPU.


Qwen3 vs Qwen 2.5

If you’re on Qwen 2.5, here’s what changed:

Qwen 2.5Qwen3
ArchitectureDense onlyDense + MoE
4B qualityDecent for sizeRivals Qwen 2.5-72B
Reasoning toggleNo/think and /no_think
Training data18T tokens36T tokens
Languages~29119
MoE optionsNone30B-A3B, 235B-A22B
LicenseApache 2.0Apache 2.0
Tool callingGoodBetter (agentic RL training)
CodingGoodQwen3-Coder is top-tier

The biggest upgrade is quality at every size. Qwen3-4B scores 83.7 on MMLU-Redux and 97.0 on MATH-500. Qwen 2.5-7B scored 74.2 on MMLU. The 4B model trained on 2x the data with better architecture now beats the previous-gen 7B comfortably.

The MoE models are entirely new. Qwen 2.5 had no MoE option. The 30B-A3B is particularly impressive โ€” it gives you 30B-scale knowledge with 3B-scale inference speed.


Benchmarks

Dense Models โ€” Quality by Size

ModelMMLU-ReduxMATH-500Arena-HardLiveCodeBench
Qwen3-4B83.797.0โ€”โ€”
Qwen3-8B84.997.4โ€”โ€”
Qwen3-14B86.797.4โ€”โ€”
Qwen3-32B87.897.481.552.7

Flagship Comparisons

ModelArena-HardAIME'24AIME'25LiveCodeBench
Qwen3-235B-A22B95.685.781.470.7
DeepSeek-R192.379.870.065.9
o191.579.2โ€”โ€”
Grok-393.383.968.0โ€”

Qwen3-235B-A22B beats DeepSeek-R1 and o1 on Arena-Hard and both AIME math benchmarks. That’s the flagship, though โ€” you need ~143GB to run it.


Qwen3-Coder: The Coding Specialist

Two coding-specific models:

Qwen3-Coder 480B-A35B โ€” the agentic coding monster. 480B total params, 35B active, 256K native context. Comparable to Claude Sonnet 4 on agentic coding benchmarks. Needs 250GB+ VRAM โ€” datacenter territory.

Qwen3-Coder 30B-A3B โ€” the practical option. Same MoE architecture as Qwen3-30B-A3B but trained on 7.5T tokens of code-heavy data. Fits on consumer hardware. Comes with an open-source CLI tool (Qwen Code) for agentic coding workflows.

ollama run qwen3-coder:30b-a3b   # Practical coding model

If you’re using local LLMs for coding, the Coder 30B-A3B is the best option that fits on consumer hardware right now.


Getting Started

The fastest path from zero to running Qwen3:

# Install Ollama if you haven't
curl -fsSL https://ollama.com/install.sh | sh

# Pick your model based on VRAM
ollama run qwen3:4b        # 4-8GB VRAM โ€” start here
ollama run qwen3:8b        # 8-12GB VRAM
ollama run qwen3:14b       # 12-16GB VRAM โ€” sweet spot
ollama run qwen3:32b       # 24GB VRAM
ollama run qwen3:30b-a3b   # 16-24GB VRAM โ€” MoE gem

Try the reasoning toggle:

>>> /think Explain why quicksort is O(n log n) average but O(nยฒ) worst case
>>> /no_think What's the capital of France?

If you’re coming from Qwen 2.5, the upgrade is worth it at every size. If you’re coming from Llama 3, Qwen3 now beats it on most benchmarks at equivalent parameter counts. If you’re on a budget GPU and want the absolute best model for your VRAM, Qwen3 is the answer.