Best Qwen 3.5 Models Ranked: Every Size, Every GPU, Every Quant
More on this topic: Qwen 3 Complete Guide | Qwen 3.5 Mac: MLX vs Ollama | VRAM Requirements | Best Local LLMs for Mac | llama.cpp vs Ollama vs vLLM
Alibaba dropped three Qwen 3.5 models on February 24, 2026, and the local AI community lost its mind. A 35B model that runs at 44 tok/s on a $450 GPU. A 27B dense model that matches DeepSeek-V3.2 on reasoning. A 122B MoE that beats GPT-5 mini on tool use by 30%. All Apache 2.0. All runnable on hardware you can buy today.
This guide covers all three models, what hardware they need, which quantization to pick, and the gotchas nobody tells you about until you’re staring at garbage output.
The three models at a glance
| Model | Total params | Active per token | Architecture | The pitch |
|---|---|---|---|---|
| 35B-A3B | 35B | 3B | MoE (256 experts, 9 active) | Fast. Fits 16GB VRAM. |
| 27B | 27B | 27B | Dense (all params active) | Smartest model under 230B. Single GPU at Q4. |
| 122B-A10B | 122B | 10B | MoE (256 experts, 9 active) | Multi-GPU only. Beats GPT-5 mini on tool use. |
All three share the same architecture innovation: Gated Delta Networks. Three out of every four layers use linear attention (O(n) scaling), with every fourth layer using full quadratic attention. The result is lower KV cache memory and faster decoding than standard transformers. They also share a 262K native context window (extendable to 1M via YaRN), native multimodal (text + image + video), and a 248K vocabulary covering 201 languages.
There’s also a 397B-A17B flagship from the initial release, but that needs 192GB+ memory and is a different conversation.
Qwen 3.5-35B-A3B: the star of the show
This is the model that changed the math for 16GB GPU owners.
35 billion total parameters across 256 experts, but only 3 billion activate per token (8 routed + 1 shared). You’re loading a 35B model into memory but computing through a 3B model on every forward pass. The speed reflects that: on an RTX 5060 Ti with 16GB VRAM, llama-bench measured 44.3 tok/s at 100K context.
That’s not a cherry-picked number at short context. That’s a hundred thousand tokens of conversation history, generating at speeds that feel instant.
VRAM requirements: 35B-A3B
| Quantization | File size | Memory needed | Fits on |
|---|---|---|---|
| IQ3_XS | 14.5 GB | ~16 GB | RTX 4060 Ti 16GB (tight) |
| Q3_K_M | 16.1 GB | ~18 GB | 16GB GPU + some offload |
| Q4_K_M | 21.2 GB | ~22 GB | RTX 3090, 4090, 5090 |
| Q5_K_M | 24.8 GB | ~26 GB | RTX 3090 with limited context |
| Q6_K | 28.7 GB | ~30 GB | RTX 4090/5090 |
| Q8_0 | 36.9 GB | ~38 GB | 48GB+ GPU or Mac 48GB+ |
| BF16 | 69.4 GB | ~70 GB | Multi-GPU or Mac 96GB+ |
The GQA design (only 2 KV heads) keeps the KV cache tiny. Even at 262K context with Q4 weights, total VRAM stays around 25 GB. That’s the whole model plus full context fitting on a single RTX 3090.
Speed benchmarks: 35B-A3B
| Hardware | Quant | Context | Generation (tok/s) | Prompt (tok/s) |
|---|---|---|---|---|
| RTX 5090 32GB | Q4_K_XL | 512 | 194 | 7,026 |
| RTX 5090 32GB | Q4_K | 262K | 97.3 | 2,003 |
| RTX 5080 16GB | Q4_K_M | 4K | ~75 | – |
| RTX 5060 Ti 16GB | Q4 | 100K | 44.3 | – |
| RTX 3090 24GB | Q4_K | 4K | 111.2 | 2,622 |
| RTX 3090 24GB | Q4_K | 131K | 79.4 | 1,288 |
| AMD R9700 (Vulkan) | Q4_K_XL | 512 | 127.4 | 2,713 |
| Strix Halo (Q8) | – | 4K | 38.5 | 960 |
| Tesla V100 32GB | Q5_K_XL | 128 | 38.4 | 570 |
| Mac M4 Max 64GB | Q4 MLX | – | ~70 | – |
| CPU (DDR5) | Q4 | – | ~5-6 | – |
The RTX 3090 at 111 tok/s is worth lingering on. That’s a $700-800 used card doing over a hundred tokens per second with a model that scores 69.2 on SWE-bench Verified. For context, Qwen3-32B (the previous generation dense model) only hit around 30-35 tok/s on the same hardware because all 32B parameters activated every token.
MoE is real.
For Mac users, there’s a dedicated MLX vs Ollama comparison with speed benchmarks across every Apple Silicon tier.
Qwen 3.5-27B: the smartest model you can fit on one card
The 27B is the only dense model in the Qwen 3.5 family. All 27 billion parameters fire on every forward pass. No expert routing, no MoE overhead, no activation-ratio tricks. It is slower per token than the 35B-A3B, but the per-token quality is higher because every parameter contributes to every prediction.
How much higher? Artificial Analysis gave it an Intelligence Index score of 42 out of 51, ranking it #1 among all open-weight models in the 4B-40B class. That matches DeepSeek-V3.2 on reasoning.
How 27B compares to 35B-A3B
| Benchmark | 27B dense | 35B-A3B MoE | Winner |
|---|---|---|---|
| GPQA Diamond | 85.5 | 84.2 | 27B |
| MMLU-Pro | 86.1 | 85.3 | 27B |
| LiveCodeBench v6 | 80.7 | 74.6 | 27B |
| SWE-bench Verified | 72.4 | 69.2 | 27B |
| IFEval | 95.0 | 91.9 | 27B |
| CodeForces | 1,899 | 2,028 | 35B |
| BFCL-V4 (tool use) | 68.5 | 67.3 | 27B |
The 27B wins on every reasoning and coding benchmark except competitive programming (CodeForces). The 35B-A3B wins on speed by 3-5x. Pick based on your bottleneck: if you’re waiting on the model to think harder, use the 27B. If you’re waiting on the model to type faster, use the 35B-A3B.
VRAM requirements: 27B dense
| Quantization | File size | Memory needed | Fits on |
|---|---|---|---|
| IQ4_XS | 14.7 GB | ~16 GB | RTX 4060 Ti 16GB (best fit) |
| Q4_K_S | 15.6 GB | ~17 GB | 16GB GPU with room for short context |
| Q4_K_M | 16.5 GB | ~18 GB | 16GB GPU (tight) or 24GB GPU |
| Q5_K_M | 19.4 GB | ~21 GB | RTX 3090 24GB |
| Q6_K | 22.7 GB | ~24 GB | RTX 3090 24GB (limited context) |
| Q8_0 | 28.6 GB | ~30 GB | RTX 4090/5090 or Mac 32GB+ |
| BF16 | 53.8 GB | ~54 GB | Multi-GPU or Mac 64GB+ |
16GB warning: Q4_K_M at 16.5 GB technically fits a 16GB card, but there’s almost no room for KV cache. Practical context will be limited to 4K-8K tokens before you OOM. For 16GB GPUs, use IQ4_XS (14.7 GB) or Q4_K_S (15.6 GB) and accept shorter context. On a 24GB card, Q4_K_M runs with full 131K context at around 24 GB total.
Speed benchmarks: 27B dense
| Hardware | Quant | Context | Generation (tok/s) | Prompt (tok/s) |
|---|---|---|---|---|
| RTX 3090 24GB | Q4_K | 4K | 33.5 | 1,104 |
| RTX 3090 24GB | Q4_K | 86K | 27.5 | 599 |
About 3x slower than the 35B-A3B on the same GPU. This is the MoE vs dense tradeoff in one table: the 35B-A3B hits 111 tok/s where the 27B hits 33.5 tok/s on identical hardware. Both produce good output. The 27B produces slightly better output on average, at one-third the speed.
For coding tasks where quality matters more than throughput – debugging a tricky function, writing a complex SQL query, reviewing a PR – the 27B is worth the wait. For chat, brainstorming, and tool-calling agents where you want snappy responses, the 35B-A3B wins.
Qwen 3.5-122B-A10B: multi-GPU territory
The 122B is the model for people who have more hardware than patience. 122 billion total parameters, 10 billion active per token (8 routed + 1 shared from 256 experts). Apache 2.0. It ties GPT-5 mini on SWE-bench (72.0) and beats it on tool use by 30% (BFCL-V4: 72.2 vs 55.5).
VRAM requirements: 122B MoE
| Quantization | File size | Memory needed | Fits on |
|---|---|---|---|
| Q3_K_M | ~59 GB | ~62 GB | Mac 64GB tight, 3x 3090 with offload |
| Q4_K_M | 74.4 GB | ~76 GB | Mac 96GB+, 4x 3090, 2x 4090 |
| Q5_K_M | 87.1 GB | ~90 GB | Mac 128GB, multi-GPU |
| Q6_K | 100.8 GB | ~106 GB | Mac 128GB+ |
| Q8_0 | 129.9 GB | ~132 GB | Mac 192GB or 2x A100 80GB |
| BF16 | 244 GB | ~245 GB | Multi-GPU cluster |
Three RTX 3090s (72GB total) can run the Q3-level quant with full GPU offload, or Q4_K_M with partial system RAM spillover. A 64GB Mac runs the model at Q3 via MLX. A 96GB+ Mac runs Q4_K_M comfortably.
Benchmarks: 122B vs the competition
| Benchmark | 122B-A10B | GPT-5 mini | Claude Sonnet 4.5 |
|---|---|---|---|
| SWE-bench Verified | 72.0 | 72.0 | 62.0 |
| BFCL-V4 (tool use) | 72.2 | 55.5 | – |
| Terminal Bench 2 | 49.4 | 31.9 | – |
| GPQA Diamond | 86.6 | 85.7 | 83.4 |
| MMLU-Pro | 86.7 | – | – |
| CodeForces | 2,100 | – | – |
This is a model that competes with commercial APIs while running on hardware you own. The tool use gap (72.2 vs 55.5) is why agent builders are paying attention.
For local inference speed, the DGX Spark (128GB unified, NVFP4) benchmarked at 8-15 tok/s depending on mode. Consumer multi-GPU numbers are still sparse since the model is four days old.
Which model should you run?
| Your hardware | Model | Quant | What you get |
|---|---|---|---|
| RTX 3060 12GB | 35B-A3B | IQ3_XS | Fits, decent quality, ~25-30 tok/s |
| RTX 4060 Ti 16GB | 35B-A3B | Q4_K_M | Sweet spot. 44 tok/s at 100K context |
| RTX 4060 Ti 16GB | 27B dense | IQ4_XS | Smarter, slower, short context |
| RTX 3090 24GB | 35B-A3B | Q4_K_M | 111 tok/s. Full 262K context. |
| RTX 3090 24GB | 27B dense | Q5_K_M | Best reasoning quality on one card |
| RTX 4090/5090 | 35B-A3B | Q6_K-Q8 | Overkill speed, high quality |
| 2-3x 3090/4090 | 122B-A10B | Q3-Q4 | API-competitive output, your hardware |
| Mac 16GB | 35B-A3B | Q4 (MLX) | Tight but works. MLX route preferred. |
| Mac 32-48GB | 27B or 35B | Q4-Q8 | Both comfortable. Pick by use case. |
| Mac 64GB+ | 122B-A10B | Q4 (MLX) | Runs Q4 comfortably |
| CPU only (DDR5) | 35B-A3B | Q4 | ~5-6 tok/s. Usable for async tasks. |
If you own one GPU and want one model, get the 35B-A3B at Q4_K_M. It fits everywhere from 16GB to 32GB, runs fast, and the quality gap vs the 27B is small outside of hard reasoning tasks.
The overthinking problem
Qwen 3.5 defaults to thinking mode. Every response starts with a <think>...</think> block where the model reasons through the problem before answering. For complex math or multi-step coding, this helps. For “what time zone is Tokyo in?”, the model burns hundreds of tokens second-guessing itself before stating the obvious.
Unlike Qwen 3, Qwen 3.5 does not support /think and /nothink commands in prompts. The model cards say this explicitly. You cannot flip thinking on and off per-message the way you could with Qwen 3.
How to disable thinking
llama.cpp:
llama-server -m qwen3.5-35b-a3b-q4_k_m.gguf \
--chat-template-kwargs '{"enable_thinking": false}' \
-c 131072 -ngl 99
Ollama: Thinking is handled at the template level. As of v0.17.4, there’s no clean per-request toggle. Workaround: create a custom Modelfile with a system prompt that says “Do not use chain-of-thought reasoning. Answer directly.” This reduces (but doesn’t eliminate) thinking tokens.
vLLM / OpenAI-compatible API:
response = client.chat.completions.create(
model="Qwen/Qwen3.5-35B-A3B",
messages=[{"role": "user", "content": "What time zone is Tokyo in?"}],
extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
Token budget approach: If you want thinking but capped, the thinking_budget API parameter limits how many reasoning tokens the model generates before it’s forced to answer. Set it to 200-500 tokens for most tasks. NVIDIA NIM implements this as nvext.max_thinking_tokens.
The quant situation
Q4_K_M is the sweet spot
For most setups, Q4_K_M is the right quant. It balances file size, quality, and speed. The quality gap between Q4_K_M and Q8_0 exists but is small for general use.
Unsloth GGUF bug (now fixed)
Unsloth’s initial “Dynamic 2.0” GGUF quants (UD-Q2_K_XL, UD-Q3_K_XL, UD-Q4_K_XL) had a bug: they applied MXFP4 precision to attention tensors where it caused quality degradation. Symptoms included garbled output and repetition loops, especially on the 122B model.
Fixed on February 27, 2026. Unsloth retired MXFP4 from all XL quant recipes. If you downloaded Qwen 3.5 GGUFs before that date, re-download them. The standard Q4_K_M quants from bartowski or lmstudio-community were never affected.
KV cache at Q8: free VRAM
Quantizing the KV cache from FP16 to Q8_0 halves the VRAM used for context with essentially no quality loss. Perplexity increase is 0.0043. That’s nothing.
llama.cpp:
llama-server -m model.gguf -ctk q8_0 -ctv q8_0 -c 131072 -ngl 99
Ollama:
OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama serve
Flash Attention must be enabled for KV cache quantization. On the 35B-A3B, this saves roughly 2-3 GB at long context – enough to squeeze in an extra 50K tokens of conversation history on a 24GB card.
llama.cpp –fit: stop guessing layer counts
llama.cpp added automatic VRAM fitting in PR #16653 (merged December 2025). The --fit flag is on by default in recent builds. It does virtual test allocations and iteratively adjusts layer offloading until the model maximizes your GPU utilization.
For MoE models specifically, --fit prioritizes keeping dense layers (attention, embedding) in VRAM and spills sparse expert weights to system RAM first. This is the right tradeoff: expert routing means only ~3% of expert weights activate per token, so the RAM latency penalty is small for those tensors.
If you’re manually setting --n-gpu-layers, you’re disabling --fit. Unless you have a specific reason, let the automation handle it.
Ollama tool calling: fixed, mostly
v0.17.3 (February 27, 2026): Fixed parsing of tool calls emitted during thinking mode. Before this, if Qwen 3.5 decided to call a tool while still inside a <think> block, the tool call was silently dropped.
v0.17.4 (February 27, 2026): Added stable indices for parallel tool calls. Also added official Qwen 3.5 model tags to the Ollama library.
Still broken: The renderer side of multi-turn tool calling has issues – prompts sent back to the model can contain unclosed <think> tags, corrupting subsequent turns. Penalty sampling (repeat_penalty, presence_penalty) is also silently ignored by the Go runner, which matters because Qwen 3.5’s official sampling parameters include presence_penalty=1.5 to prevent repetition loops.
If you’re building agents with Qwen 3.5 tool calling, test with llama.cpp or vLLM first. Ollama works for basic tool use but gets unreliable in multi-turn agentic loops.
How to run it
Ollama (simplest)
# 35B-A3B (recommended for most users)
ollama pull qwen3.5:35b-a3b
ollama run qwen3.5:35b-a3b
# 27B dense
ollama pull qwen3.5:27b
ollama run qwen3.5:27b
# 122B MoE (needs ~76GB)
ollama pull qwen3.5:122b
ollama run qwen3.5:122b
llama.cpp (more control)
# Download from HuggingFace
huggingface-cli download bartowski/Qwen_Qwen3.5-35B-A3B-GGUF \
--include "*Q4_K_M*" \
--local-dir ./qwen35
# Run with KV cache quantization + auto-fit
llama-server \
-m ./qwen35/Qwen3.5-35B-A3B-Q4_K_M.gguf \
-c 131072 \
-ctk q8_0 -ctv q8_0 \
-ngl 99
Mac (MLX for speed, Ollama for ecosystem)
MLX runs the 35B-A3B roughly 2x faster than Ollama on Apple Silicon. See the MLX vs Ollama benchmark guide for setup instructions and speed numbers across every chip tier.
# MLX route (fastest on Mac)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3.5-35B-A3B-4bit \
--prompt "Explain MoE architecture" --max-tokens 500
Known issues (February 2026)
Repetition in thinking mode: Without presence_penalty=1.5, the model tends to loop during extended reasoning. Set this in your inference config. Ollama silently ignores penalty parameters as of v0.17.4 – use llama.cpp or vLLM if you need reliable penalty sampling.
Vision model assertion in llama.cpp: The multimodal model assertion prevents KV cache reuse, forcing full prompt reprocessing every turn. If you’re text-only, don’t pass --mmproj. A fix PR exists but hasn’t merged.
27B CUDA eval bug: Issue #19860 reports CUDA errors during evaluation. Make sure you’re on the latest llama.cpp build.
Qwen 3.5 is NOT Qwen 3. The model architecture is different (Gated DeltaNet vs standard transformer). Chat templates differ. The /think//nothink toggle from Qwen 3 does not work. Don’t assume your Qwen 3 configs transfer cleanly.
Bottom line
The 35B-A3B is the model that matters for most local AI builders. It runs on a $450 GPU at speeds that feel like a cloud API, scores within 5% of the 27B dense on most benchmarks, and fits more context than you’ll probably use. The 27B is the choice when quality per token matters more than throughput. The 122B is for multi-GPU setups that want output matching commercial APIs without the monthly bill.
Q4_K_M, KV cache at Q8, and let --fit handle the layer allocation. That’s the setup. The rest is just choosing your model.
Get notified when we publish new guides.
Subscribe — free, no spam