Best Qwen 3.5 Setup: Which Model Fits Your GPU (Complete Cheat Sheet)
Qwen 3.5 dropped on February 24, 2026, and it changes the local AI math. Four models spanning 27B to 397B parameters, all Apache 2.0 licensed, all natively multimodal (text + image + video), and the MoE variants run faster than models a fraction of their size.
The 35B-A3B hits 194 tok/s on an RTX 5090. The 27B dense model scores 72.4 on SWE-bench Verified, matching GPT-5 mini. The 122B-A10B beats GPT-5 mini by 30% on tool-use benchmarks while running on a Mac Studio.
This is the complete guide to running them locally: which model fits your GPU, which quantization to pick, and where each one actually excels.
Qwen 3.6 update (April 2026)
Qwen 3.6-35B-A3B dropped in mid-April, and it’s the model r/LocalLLaMA has been talking about for a week straight. Same architecture as 3.5-35B-A3B — MoE with 3B active parameters, 262K context, Apache 2.0 — but the benchmarks moved in the right direction across the board.
Per the Unsloth GGUF card, here’s the head-to-head against the 3.5 version of the same-sized model:
| Benchmark | Qwen3.5-35B-A3B | Qwen3.6-35B-A3B |
|---|---|---|
| SWE-bench Verified | 70.0 | 73.4 |
| SWE-bench Multilingual | 60.3 | 67.2 |
| SWE-bench Pro | 44.6 | 49.5 |
| Terminal-Bench 2.0 | 40.5 | 51.5 |
| MCPMark (tool use) | 27.0 | 37.0 |
| GPQA | 84.2 | 86.0 |
| AIME 2026 | 91.0 | 92.7 |
| MMLU-Pro | 85.3 | 85.2 |
Terminal-Bench 2.0 jumping from 40.5 to 51.5, and MCPMark from 27.0 to 37.0, are the numbers that matter for local coding agents. This is the model people are swapping in under OpenCode and Codex-compatible harnesses. The developer-role support is built in — it’s a deliberate target for agentic coding, not a by-product.
Why people are switching from 3.5-122B and even Opus 4.7: the 3.5-122B-A10B is the better model on paper, but it needs 70GB+ to run at Q4. The 3.6-35B-A3B gets most of the agent-coding win in a footprint you can run on a laptop. Simon Willison’s post — “Qwen3.6-35B-A3B on my laptop drew me a better pelican than Claude Opus 4.7” — is half joke, half the point. His own caveat: “I very much doubt that a 21GB quantized version of their latest model is more powerful or useful than Anthropic’s latest proprietary release.” Treat it as “close enough on narrow tasks that you should test it yourself,” not “open weights now beat frontier.”
Hardware envelope (per Unsloth’s how-to-run guide):
| Quant | Memory needed |
|---|---|
| UD-IQ1_M | 10 GB |
| UD-Q3_K_M | 17 GB |
| UD-Q4_K_M | 23 GB |
| UD-Q5_K_M | 26.5 GB |
| Q8_0 | 37 GB |
| BF16 | 70 GB |
A 24GB GPU runs UD-Q4_K_M comfortably. On a 16GB card, drop to UD-Q3_K_M. On an 8GB laptop GPU with 64-96GB system RAM, the Unsloth guide confirms SSD/RAM offloading works — r/LocalLLaMA users report RTX 4060 laptop + 96GB configs running at acceptable token rates for agentic workflows. Unsloth’s own Mean KL Divergence testing claims their Dynamic GGUFs sit on “the SOTA Pareto frontier,” ranking first in 21 of 22 sizes tested against BF16 reference distributions — attributed to Unsloth, not independently verified.
The Ollama gotcha. As of this writing, Qwen 3.6 GGUFs don’t load in Ollama. Per the Unsloth docs: “Currently no Qwen3.6 GGUF works in Ollama due to separate mmproj vision files.” If you want 3.6 today, use llama.cpp, LM Studio, vLLM, or SGLang. Ollama support will follow — this is a format issue, not a model issue.
Qwen3.6-Max-Preview. Alibaba also released Qwen3.6-Max-Preview on Qwen Chat on April 20, 2026. It scores 52 on the Artificial Analysis Intelligence Index (#3 of 203 models tested), the highest among Chinese models and competitive with frontier closed systems. Honest flag: this one is closed-weights, not open source. It’s only accessible via Qwen Chat and API. So while it’s newsworthy for the category, it doesn’t change what you can run locally — the open 3.6-35B-A3B is still the one that matters for this guide.
Should you upgrade from 3.5-35B-A3B? Yes, if you can tolerate switching off Ollama for now. The benchmark deltas are real and mostly one-sided. If you’re running an Ollama-based setup today and can’t break your tooling, stay on 3.5 for a few more weeks — it’s still a strong model, and 3.6 support will land in Ollama soon.
The rest of this guide still covers the Qwen 3.5 lineup in full, because 3.5 is still in heavy production use, drives most of the search traffic landing on this page, and the 27B dense and 122B-A10B variants don’t have direct 3.6 equivalents yet.
The lineup
| Model | Total Params | Active Params | Architecture | Context | GGUF Q4 Size |
|---|---|---|---|---|---|
| Qwen3.5-27B | 27B | 27B (dense) | Dense + Hybrid Attention | 262K | ~17 GB |
| Qwen3.5-35B-A3B | 35B | 3B | MoE + Hybrid Attention | 262K | ~22 GB |
| Qwen3.5-122B-A10B | 122B | 10B | MoE + Hybrid Attention | 262K | ~70 GB |
| Qwen3.5-397B-A17B | 397B | 17B | MoE + Hybrid Attention | 262K | ~214 GB |
All four models support 262K context natively (1M via YaRN extension), 201 languages, thinking/non-thinking modes, and multi-token prediction for speculative decoding. FP8 weights are available for every size.
The architecture is new. Qwen 3.5 uses a hybrid of Gated DeltaNet (linear attention) and full attention in a 3:1 ratio: three DeltaNet layers for every one full attention layer. The linear attention layers scale near-linearly with sequence length, which is why these models handle 256K context without the speed cliff you’d expect.
The 35B-A3B: the one most people should run
The 35B-A3B is the successor to the community favorite Qwen3-30B-A3B. It’s a Mixture of Experts model with 35 billion total parameters but only 3 billion active per token. That means it runs at small-model speeds while drawing from large-model knowledge.
Real-world speeds:
| GPU | Quantization | Token Generation | Prompt Processing |
|---|---|---|---|
| RTX 5090 (CUDA) | Q4_K_XL | 194 tok/s | 7,026 tok/s |
| AMD R9700 (Vulkan) | Q4_K_XL | 127 tok/s | 2,713 tok/s |
| DGX Spark | Q5 (UD-Q5_K_XL) | 58.6 tok/s | 1,861 tok/s |
| DGX Spark | Q8 (UD-Q8_K_XL) | 35.9 tok/s | 1,733 tok/s |
| Tesla V100 32GB | GGUF | 38.4 tok/s | 570 tok/s |
194 tok/s on an RTX 5090. That’s faster than most 7B models ran a year ago. Even the V100 (a card you can find used for $300-400) manages 38 tok/s.
The benchmarks are strong for a model this fast:
| Benchmark | 35B-A3B | GPT-5 mini | Claude Sonnet 4.5 |
|---|---|---|---|
| MMLU-Pro | 85.3 | 83.7 | 80.8 |
| GPQA Diamond | 84.2 | 82.8 | 80.1 |
| SWE-bench Verified | 69.2 | 72.0 | 62.0 |
| BFCL-V4 (Tool Use) | 67.3 | 55.5 | 54.8 |
| BrowseComp | 61.0 | 48.1 | 41.1 |
| TAU2-Bench (Agentic) | 81.2 | – | – |
That BFCL-V4 score deserves attention. 67.3 vs GPT-5 mini’s 55.5 on function calling and tool use. For anyone building local AI agents, this is the new default model to test.
The vision capabilities are also native. The 35B-A3B scores 81.4 on MMMU and 91.5 on MMBench, performing close to models 7x its size on visual benchmarks. You don’t need a separate vision model anymore.
The caveat: the 35B-A3B scores lower than the 27B dense model on all coding benchmarks. SWE-bench 69.2 vs 72.4. LiveCodeBench 74.6 vs 80.7. Early community reports also flag that it produces broken diffs and hallucinates APIs on real repo work. If sustained coding is your primary use case, read the 27B section below.
The 27B dense: the coder’s pick
The 27B is the dense model in the family, replacing the older Qwen3-32B. Every parameter is active on every token, which means slower generation but deeper reasoning per-token than the MoE variants.
| Benchmark | 27B | 35B-A3B | GPT-5 mini |
|---|---|---|---|
| SWE-bench Verified | 72.4 | 69.2 | 72.0 |
| LiveCodeBench v6 | 80.7 | 74.6 | 80.5 |
| Terminal-Bench 2 | 41.6 | 40.5 | 31.9 |
| HMMT Feb 2025 (Math) | 92.0 | 89.0 | 89.2 |
72.4 on SWE-bench Verified matches GPT-5 mini exactly. Terminal-Bench 2 at 41.6 crushes GPT-5 mini’s 31.9. This is a competitive coding model at a size that fits on a single 24GB GPU at Q4.
At Q8 quantization, the 27B needs about 30GB. That puts it in A6000 (48GB) or Mac M-series territory for the highest quality quant. At Q4_K_M, it fits on a 4090 with room to spare at ~17GB.
After llama.cpp PR #19866 fixed multi-GPU graph splits, users report the 27B running across an RTX 3090 + RTX 3070 at over 700 tok/s prompt processing and 20+ tok/s generation using -ts 85,15. Multi-GPU setups are viable if you have the cards.
Pick the 27B over the 35B-A3B if your primary task is coding and you want the densest reasoning per token on a 24GB GPU. Pick the 35B-A3B if you need speed, tool use, or mixed workloads.
The 122B-A10B: the Mac Studio model
The 122B-A10B sits between the consumer models and the flagship. 122 billion total parameters, 10 billion active, built for machines with 48GB+ of unified memory.
| Benchmark | 122B-A10B | GPT-5 mini | Claude Sonnet 4.5 |
|---|---|---|---|
| MMLU-Pro | 86.7 | 83.7 | 80.8 |
| SWE-bench Verified | 72.0 | 72.0 | 62.0 |
| Terminal-Bench 2 | 49.4 | 31.9 | 18.7 |
| BFCL-V4 (Tool Use) | 72.2 | 55.5 | 54.8 |
| BrowseComp | 63.8 | 48.1 | 41.1 |
Terminal-Bench 2 at 49.4 vs GPT-5 mini’s 31.9 is not a close race. BFCL-V4 at 72.2 vs 55.5 is a 30% margin on tool use. If you have the hardware, this model outperforms GPT-5 mini on most tasks while running entirely on your machine.
At Q4 quantization, the 122B needs ~70GB. That means a Mac Studio with 96GB+ unified memory, a system with dual GPUs totaling 80GB+, or server hardware like the DGX Spark. Not consumer-friendly, but if you already have an M4 Max or Ultra Mac, this is the model that justifies the investment.
The 397B-A17B flagship
The flagship runs at 45 tok/s on 8xH100s with 8.6x faster decoding than Qwen3-Max. At Q4 quantization it needs ~214GB. That’s Strix Halo 128GB territory (tight), Mac Ultra with 192GB+, or dedicated server hardware.
It beats GPT-5.2 on instruction following (IFBench 76.5 vs 75.4, the highest score of any model tested) and MultiChallenge (67.6 vs 57.9). It trails GPT-5.2 on AIME 2026 (91.3 vs 96.7) and SWE-bench (76.4 vs 80.0). Competitive with the best frontier models, but you need serious hardware.
For most local AI users, this is an aspirational model. The 122B or 35B-A3B cover 95% of use cases at a fraction of the hardware cost.
Which model on which GPU
| Your Hardware | VRAM | Best Qwen 3.5 Model | Quantization | Expected Speed |
|---|---|---|---|---|
| RTX 3060 / 4060 (8GB) | 8 GB | 35B-A3B | Q2-Q3 (tight) | Usable but slow |
| RTX 3060 12GB | 12 GB | 35B-A3B | Q4_K_M | ~30-40 tok/s |
| RTX 5060 Ti / 5080 (16GB) | 16 GB | 35B-A3B | Q4_K_L or Q4_K_XL | ~40-60 tok/s |
| RTX 4090 / 3090 (24GB) | 24 GB | 27B at Q4 or 35B-A3B at Q8 | Q4_K_M / Q8 | 20-60 tok/s |
| A6000 / dual GPU (48GB) | 48 GB | 27B at Q8 or 122B-A10B at Q4 | Q8 / Q4_K_M | 15-35 tok/s |
| Mac M4 Max (64GB) | 64 GB | 122B-A10B | Q4_K_M | Varies |
| Mac Ultra / Strix Halo (128GB+) | 128 GB+ | 397B-A17B | Q4 | Server-class |
The 16GB tier is the sweet spot in 2026. The RTX 5060 Ti 16GB and RTX 5080 16GB both run the 35B-A3B at Q4 comfortably, and the MoE architecture means you’re getting 35B-class knowledge from a model that only activates 3B parameters per token. Check exact VRAM figures with the VRAM Calculator and our VRAM requirements guide.
On a 4090 or 3090, you have a real choice. The 27B at Q4_K_M (~17GB) leaves room for other tools and gives you the strongest coding model in the family. The 35B-A3B at Q8 (~22GB, higher quality quant) gives you faster generation and better tool use. If you do both coding and agentic work, keep both in Ollama and swap between them.
Quantization: which quant matters
Unsloth’s benchmarks on the 35B-A3B show real quality differences between quants:
| Quantization | Top-1 Token Agreement | Notes |
|---|---|---|
| Q4_K_L | 89% | Best quality retention at 4-bit |
| MXFP4 | Good (PPL +1.38) | New format, fast |
| UD-Q4_K_XL | 79.4% | Lowest quality at 4-bit |
Q4_K_L retains the highest quality. If your GPU can fit it, prefer Q4_K_L over Q4_K_XL.
For the 397B flagship, Unsloth reports UD-Q4_K_XL stays within 1 point of accuracy on most benchmarks despite reducing the file by ~500GB. At that scale, aggressive quantization hurts less.
Other findings worth knowing:
- 8-bit KV cache improves output quality when running 4-bit model quants
- Q3_K_XL reportedly beats Q4 on some benchmarks (Unsloth finding), though this needs broader validation
- FP8 weights are available officially for all sizes, giving you a clean middle ground between full precision and GGUF quants
How to run it
Ollama (simplest):
ollama run qwen3.5:35b
# or for the 27B:
ollama run qwen3.5:27b-q4_K_M
Requires Ollama v0.9.0 or higher. The multimodal support (images) works out of the box.
Note on Qwen 3.6: as of April 2026, Qwen 3.6 GGUFs don’t load in Ollama because of a separate mmproj vision file. Per Unsloth, the fix is upstream work in progress. For 3.6 today, use llama.cpp, LM Studio, vLLM, or SGLang.
llama.cpp (most control):
./llama-server -hf Qwen/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
--jinja --reasoning-format deepseek -ngl 99
Build from source or use a release after b5092. If you need multi-GPU, make sure you have the PR #19866 fix (merged Feb 24) or a build from after that date.
Disable thinking by default (saves tokens on simple tasks):
./llama-server -hf Qwen/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
--jinja --chat-template-kwargs '{"enable_thinking": false}' -ngl 99
You can also add /think or /nothink to individual messages to toggle per-request.
Recommended sampling parameters (from Qwen):
- General thinking mode: temperature 1.0, top_p 0.95, top_k 20, presence_penalty 1.5
- Coding mode: temperature 0.6, top_p 0.95, top_k 20, presence_penalty 0.0
- Non-thinking instruct: temperature 0.7, top_p 0.8, top_k 20, presence_penalty 1.5
Known issues (day 2)
The models dropped 24 hours ago. Expect rough edges.
llama.cpp crashes on some multi-GPU configurations when running the 27B (issue #19860, illegal memory access on dual 3090s). PR #19866 (merged Feb 24) fixes most graph split issues, but build from latest main to be safe.
There’s a 35% speed regression vs Qwen3 on CUDA. Issue #19894 shows the 35B-A3B at 38 tok/s on Tesla V100 vs the older 30B-A3B at 59 tok/s. CPU cores hit full load during generation, which suggests the DeltaNet architecture needs further CUDA optimization. This should improve in coming builds.
GGUF vision loading is broken (issue #19857, fails on vision projector weights). If you need multimodal now, use the HuggingFace weights with vLLM or SGLang.
The thinking model crashes llama-cli in some configurations (issue #19869). Workaround: --chat-template-kwargs '{"enable_thinking": false}'.
Some users report repetition/looping. Fix it with --presence-penalty 1.5 (or up to 2.0).
Qwen 3.5 vs Qwen 3: what changed
| Qwen 3 | Qwen 3.5 | |
|---|---|---|
| Dense model | 32B | 27B (denser, better benchmarks) |
| Small MoE | 30B-A3B | 35B-A3B (5B more total params) |
| Medium MoE | – | 122B-A10B (new tier) |
| Large MoE | 235B-A22B | 397B-A17B |
| Architecture | Standard attention | Hybrid DeltaNet + attention (3:1) |
| Multimodal | Separate VL models | Native in all models |
| Context | 128K | 262K (1M via YaRN) |
| FP8 weights | Community only | Official |
| Vocabulary | 152K tokens | 250K tokens |
The 35B-A3B beats the previous flagship Qwen3-235B on language, vision, and agent benchmarks despite being about 7x smaller in total parameters. The architectural shift to DeltaNet is the reason: it scales better with context length and lets the MoE models pack more capability per active parameter.
The bottom line
As of April 2026, the pick depends on your tooling.
If you can use llama.cpp, LM Studio, vLLM, or SGLang: run Qwen 3.6-35B-A3B at UD-Q4_K_M on a 24GB GPU, or UD-Q3_K_M on a 16GB card. Stronger benchmarks than 3.5 across SWE-bench, Terminal-Bench, MCPMark, and GPQA. Best default for agentic coding and tool use.
If you’re locked into Ollama: stay on Qwen 3.5-35B-A3B until the mmproj/vision loading issue gets resolved. It’s still a strong model. 3.6 support will land in Ollama soon.
If your primary task is sustained coding work on a 24GB GPU: the Qwen 3.5-27B dense model is still the deeper-per-token choice. 3.6 doesn’t yet ship a 27B dense variant.
If you have 48GB+ unified memory: the Qwen 3.5-122B-A10B remains the best open-weight option for heavy tool-use workflows. Nothing in the 3.6 lineup replaces it yet.
# Qwen 3.6 via llama.cpp
./llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M --jinja -ngl 99
# Qwen 3.5 via Ollama (if 3.6 tooling isn't ready for you)
ollama run qwen3.5:35b
Try it.
Get notified when we publish new guides.
Subscribe — free, no spam