Best Qwen 3.5 Models Ranked: Every Size, Every GPU, Every Quant

More on this topic: Qwen 3 Complete Guide | Qwen 3.5 Mac: MLX vs Ollama | VRAM Requirements | Best Local LLMs for Mac | llama.cpp vs Ollama vs vLLM

Alibaba dropped three Qwen 3.5 models on February 24, 2026, and the local AI community lost its mind. A 35B model that runs at 44 tok/s on a $450 GPU. A 27B dense model that matches DeepSeek-V3.2 on reasoning. A 122B MoE that beats GPT-5 mini on tool use by 30%. All Apache 2.0. All runnable on hardware you can buy today.

This guide covers all three models, what hardware they need, which quantization to pick, and the gotchas nobody tells you about until you’re staring at garbage output.

The three models at a glance

Model	Total params	Active per token	Architecture	The pitch
35B-A3B	35B	3B	MoE (256 experts, 9 active)	Fast. Fits 16GB VRAM.
27B	27B	27B	Dense (all params active)	Smartest model under 230B. Single GPU at Q4.
122B-A10B	122B	10B	MoE (256 experts, 9 active)	Multi-GPU only. Beats GPT-5 mini on tool use.

All three share the same architecture innovation: Gated Delta Networks. Three out of every four layers use linear attention (O(n) scaling), with every fourth layer using full quadratic attention. The result is lower KV cache memory and faster decoding than standard transformers. They also share a 262K native context window (extendable to 1M via YaRN), native multimodal (text + image + video), and a 248K vocabulary covering 201 languages.

There’s also a 397B-A17B flagship from the initial release, but that needs 192GB+ memory and is a different conversation.

Qwen 3.5-35B-A3B: the star of the show

This is the model that changed the math for 16GB GPU owners.

35 billion total parameters across 256 experts, but only 3 billion activate per token (8 routed + 1 shared). You’re loading a 35B model into memory but computing through a 3B model on every forward pass. The speed reflects that: on an RTX 5060 Ti with 16GB VRAM, llama-bench measured 44.3 tok/s at 100K context.

That’s not a cherry-picked number at short context. That’s a hundred thousand tokens of conversation history, generating at speeds that feel instant.

VRAM requirements: 35B-A3B

Quantization	File size	Memory needed	Fits on
IQ3_XS	14.5 GB	~16 GB	RTX 4060 Ti 16GB (tight)
Q3_K_M	16.1 GB	~18 GB	16GB GPU + some offload
Q4_K_M	21.2 GB	~22 GB	RTX 3090, 4090, 5090
Q5_K_M	24.8 GB	~26 GB	RTX 3090 with limited context
Q6_K	28.7 GB	~30 GB	RTX 4090/5090
Q8_0	36.9 GB	~38 GB	48GB+ GPU or Mac 48GB+
BF16	69.4 GB	~70 GB	Multi-GPU or Mac 96GB+

The GQA design (only 2 KV heads) keeps the KV cache tiny. Even at 262K context with Q4 weights, total VRAM stays around 25 GB. That’s the whole model plus full context fitting on a single RTX 3090.

Speed benchmarks: 35B-A3B

Hardware	Quant	Context	Generation (tok/s)	Prompt (tok/s)
RTX 5090 32GB	Q4_K_XL	512	194	7,026
RTX 5090 32GB	Q4_K	262K	97.3	2,003
RTX 5080 16GB	Q4_K_M	4K	~75	–
RTX 5060 Ti 16GB	Q4	100K	44.3	–
RTX 3090 24GB	Q4_K	4K	111.2	2,622
RTX 3090 24GB	Q4_K	131K	79.4	1,288
AMD R9700 (Vulkan)	Q4_K_XL	512	127.4	2,713
Strix Halo (Q8)	–	4K	38.5	960
Tesla V100 32GB	Q5_K_XL	128	38.4	570
Mac M4 Max 64GB	Q4 MLX	–	~70	–
CPU (DDR5)	Q4	–	~5-6	–

The RTX 3090 at 111 tok/s is worth lingering on. That’s a $700-800 used card doing over a hundred tokens per second with a model that scores 69.2 on SWE-bench Verified. For context, Qwen3-32B (the previous generation dense model) only hit around 30-35 tok/s on the same hardware because all 32B parameters activated every token.

MoE is real.

For Mac users, there’s a dedicated MLX vs Ollama comparison with speed benchmarks across every Apple Silicon tier.

Qwen 3.5-27B: the smartest model you can fit on one card

The 27B is the only dense model in the Qwen 3.5 family. All 27 billion parameters fire on every forward pass. No expert routing, no MoE overhead, no activation-ratio tricks. It is slower per token than the 35B-A3B, but the per-token quality is higher because every parameter contributes to every prediction.

How much higher? Artificial Analysis gave it an Intelligence Index score of 42 out of 51, ranking it #1 among all open-weight models in the 4B-40B class. That matches DeepSeek-V3.2 on reasoning.

How 27B compares to 35B-A3B

Benchmark	27B dense	35B-A3B MoE	Winner
GPQA Diamond	85.5	84.2	27B
MMLU-Pro	86.1	85.3	27B
LiveCodeBench v6	80.7	74.6	27B
SWE-bench Verified	72.4	69.2	27B
IFEval	95.0	91.9	27B
CodeForces	1,899	2,028	35B
BFCL-V4 (tool use)	68.5	67.3	27B

The 27B wins on every reasoning and coding benchmark except competitive programming (CodeForces). The 35B-A3B wins on speed by 3-5x. Pick based on your bottleneck: if you’re waiting on the model to think harder, use the 27B. If you’re waiting on the model to type faster, use the 35B-A3B.

VRAM requirements: 27B dense

Quantization	File size	Memory needed	Fits on
IQ4_XS	14.7 GB	~16 GB	RTX 4060 Ti 16GB (best fit)
Q4_K_S	15.6 GB	~17 GB	16GB GPU with room for short context
Q4_K_M	16.5 GB	~18 GB	16GB GPU (tight) or 24GB GPU
Q5_K_M	19.4 GB	~21 GB	RTX 3090 24GB
Q6_K	22.7 GB	~24 GB	RTX 3090 24GB (limited context)
Q8_0	28.6 GB	~30 GB	RTX 4090/5090 or Mac 32GB+
BF16	53.8 GB	~54 GB	Multi-GPU or Mac 64GB+

16GB warning: Q4_K_M at 16.5 GB technically fits a 16GB card, but there’s almost no room for KV cache. Practical context will be limited to 4K-8K tokens before you OOM. For 16GB GPUs, use IQ4_XS (14.7 GB) or Q4_K_S (15.6 GB) and accept shorter context. On a 24GB card, Q4_K_M runs with full 131K context at around 24 GB total.

Speed benchmarks: 27B dense

Hardware	Quant	Context	Generation (tok/s)	Prompt (tok/s)
RTX 3090 24GB	Q4_K	4K	33.5	1,104
RTX 3090 24GB	Q4_K	86K	27.5	599

About 3x slower than the 35B-A3B on the same GPU. This is the MoE vs dense tradeoff in one table: the 35B-A3B hits 111 tok/s where the 27B hits 33.5 tok/s on identical hardware. Both produce good output. The 27B produces slightly better output on average, at one-third the speed.

For coding tasks where quality matters more than throughput – debugging a tricky function, writing a complex SQL query, reviewing a PR – the 27B is worth the wait. For chat, brainstorming, and tool-calling agents where you want snappy responses, the 35B-A3B wins.

Qwen 3.5-122B-A10B: multi-GPU territory

The 122B is the model for people who have more hardware than patience. 122 billion total parameters, 10 billion active per token (8 routed + 1 shared from 256 experts). Apache 2.0. It ties GPT-5 mini on SWE-bench (72.0) and beats it on tool use by 30% (BFCL-V4: 72.2 vs 55.5).

VRAM requirements: 122B MoE

Quantization	File size	Memory needed	Fits on
Q3_K_M	~59 GB	~62 GB	Mac 64GB tight, 3x 3090 with offload
Q4_K_M	74.4 GB	~76 GB	Mac 96GB+, 4x 3090, 2x 4090
Q5_K_M	87.1 GB	~90 GB	Mac 128GB, multi-GPU
Q6_K	100.8 GB	~106 GB	Mac 128GB+
Q8_0	129.9 GB	~132 GB	Mac 192GB or 2x A100 80GB
BF16	244 GB	~245 GB	Multi-GPU cluster

Three RTX 3090s (72GB total) can run the Q3-level quant with full GPU offload, or Q4_K_M with partial system RAM spillover. A 64GB Mac runs the model at Q3 via MLX. A 96GB+ Mac runs Q4_K_M comfortably.

Benchmarks: 122B vs the competition

Benchmark	122B-A10B	GPT-5 mini	Claude Sonnet 4.5
SWE-bench Verified	72.0	72.0	62.0
BFCL-V4 (tool use)	72.2	55.5	–
Terminal Bench 2	49.4	31.9	–
GPQA Diamond	86.6	85.7	83.4
MMLU-Pro	86.7	–	–
CodeForces	2,100	–	–

This is a model that competes with commercial APIs while running on hardware you own. The tool use gap (72.2 vs 55.5) is why agent builders are paying attention.

For local inference speed, the DGX Spark (128GB unified, NVFP4) benchmarked at 8-15 tok/s depending on mode. Consumer multi-GPU numbers are still sparse since the model is four days old.

Which model should you run?

Your hardware	Model	Quant	What you get
RTX 3060 12GB	35B-A3B	IQ3_XS	Fits, decent quality, ~25-30 tok/s
RTX 4060 Ti 16GB	35B-A3B	Q4_K_M	Sweet spot. 44 tok/s at 100K context
RTX 4060 Ti 16GB	27B dense	IQ4_XS	Smarter, slower, short context
RTX 3090 24GB	35B-A3B	Q4_K_M	111 tok/s. Full 262K context.
RTX 3090 24GB	27B dense	Q5_K_M	Best reasoning quality on one card
RTX 4090/5090	35B-A3B	Q6_K-Q8	Overkill speed, high quality
2-3x 3090/4090	122B-A10B	Q3-Q4	API-competitive output, your hardware
Mac 16GB	35B-A3B	Q4 (MLX)	Tight but works. MLX route preferred.
Mac 32-48GB	27B or 35B	Q4-Q8	Both comfortable. Pick by use case.
Mac 64GB+	122B-A10B	Q4 (MLX)	Runs Q4 comfortably
CPU only (DDR5)	35B-A3B	Q4	~5-6 tok/s. Usable for async tasks.

If you own one GPU and want one model, get the 35B-A3B at Q4_K_M. It fits everywhere from 16GB to 32GB, runs fast, and the quality gap vs the 27B is small outside of hard reasoning tasks.

The overthinking problem

Qwen 3.5 defaults to thinking mode. Every response starts with a <think>...</think> block where the model reasons through the problem before answering. For complex math or multi-step coding, this helps. For “what time zone is Tokyo in?”, the model burns hundreds of tokens second-guessing itself before stating the obvious.

Unlike Qwen 3, Qwen 3.5 does not support /think and /nothink commands in prompts. The model cards say this explicitly. You cannot flip thinking on and off per-message the way you could with Qwen 3.

How to disable thinking

llama.cpp:

llama-server -m qwen3.5-35b-a3b-q4_k_m.gguf \
  --chat-template-kwargs '{"enable_thinking": false}' \
  -c 131072 -ngl 99

Ollama: Thinking is handled at the template level. As of v0.17.4, there’s no clean per-request toggle. Workaround: create a custom Modelfile with a system prompt that says “Do not use chain-of-thought reasoning. Answer directly.” This reduces (but doesn’t eliminate) thinking tokens.

vLLM / OpenAI-compatible API:

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-35B-A3B",
    messages=[{"role": "user", "content": "What time zone is Tokyo in?"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

Token budget approach: If you want thinking but capped, the thinking_budget API parameter limits how many reasoning tokens the model generates before it’s forced to answer. Set it to 200-500 tokens for most tasks. NVIDIA NIM implements this as nvext.max_thinking_tokens.

The quant situation

Q4_K_M is the sweet spot

For most setups, Q4_K_M is the right quant. It balances file size, quality, and speed. The quality gap between Q4_K_M and Q8_0 exists but is small for general use.

Unsloth GGUF bug (now fixed)

Unsloth’s initial “Dynamic 2.0” GGUF quants (UD-Q2_K_XL, UD-Q3_K_XL, UD-Q4_K_XL) had a bug: they applied MXFP4 precision to attention tensors where it caused quality degradation. Symptoms included garbled output and repetition loops, especially on the 122B model.

Fixed on February 27, 2026. Unsloth retired MXFP4 from all XL quant recipes. If you downloaded Qwen 3.5 GGUFs before that date, re-download them. The standard Q4_K_M quants from bartowski or lmstudio-community were never affected.

KV cache at Q8: free VRAM

Quantizing the KV cache from FP16 to Q8_0 halves the VRAM used for context with essentially no quality loss. Perplexity increase is 0.0043. That’s nothing.

llama.cpp:

llama-server -m model.gguf -ctk q8_0 -ctv q8_0 -c 131072 -ngl 99

Ollama:

OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama serve

Flash Attention must be enabled for KV cache quantization. On the 35B-A3B, this saves roughly 2-3 GB at long context – enough to squeeze in an extra 50K tokens of conversation history on a 24GB card.

llama.cpp –fit: stop guessing layer counts

llama.cpp added automatic VRAM fitting in PR #16653 (merged December 2025). The --fit flag is on by default in recent builds. It does virtual test allocations and iteratively adjusts layer offloading until the model maximizes your GPU utilization.

For MoE models specifically, --fit prioritizes keeping dense layers (attention, embedding) in VRAM and spills sparse expert weights to system RAM first. This is the right tradeoff: expert routing means only ~3% of expert weights activate per token, so the RAM latency penalty is small for those tensors.

If you’re manually setting --n-gpu-layers, you’re disabling --fit. Unless you have a specific reason, let the automation handle it.

Ollama tool calling: fixed, mostly

v0.17.3 (February 27, 2026): Fixed parsing of tool calls emitted during thinking mode. Before this, if Qwen 3.5 decided to call a tool while still inside a <think> block, the tool call was silently dropped.

v0.17.4 (February 27, 2026): Added stable indices for parallel tool calls. Also added official Qwen 3.5 model tags to the Ollama library.

Still broken: The renderer side of multi-turn tool calling has issues – prompts sent back to the model can contain unclosed <think> tags, corrupting subsequent turns. Penalty sampling (repeat_penalty, presence_penalty) is also silently ignored by the Go runner, which matters because Qwen 3.5’s official sampling parameters include presence_penalty=1.5 to prevent repetition loops.

If you’re building agents with Qwen 3.5 tool calling, test with llama.cpp or vLLM first. Ollama works for basic tool use but gets unreliable in multi-turn agentic loops.

How to run it

Ollama (simplest)

# 35B-A3B (recommended for most users)
ollama pull qwen3.5:35b-a3b
ollama run qwen3.5:35b-a3b

# 27B dense
ollama pull qwen3.5:27b
ollama run qwen3.5:27b

# 122B MoE (needs ~76GB)
ollama pull qwen3.5:122b
ollama run qwen3.5:122b

llama.cpp (more control)

# Download from HuggingFace
huggingface-cli download bartowski/Qwen_Qwen3.5-35B-A3B-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ./qwen35

# Run with KV cache quantization + auto-fit
llama-server \
  -m ./qwen35/Qwen3.5-35B-A3B-Q4_K_M.gguf \
  -c 131072 \
  -ctk q8_0 -ctv q8_0 \
  -ngl 99

Mac (MLX for speed, Ollama for ecosystem)

MLX runs the 35B-A3B roughly 2x faster than Ollama on Apple Silicon. See the MLX vs Ollama benchmark guide for setup instructions and speed numbers across every chip tier.

# MLX route (fastest on Mac)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3.5-35B-A3B-4bit \
  --prompt "Explain MoE architecture" --max-tokens 500

Known issues (February 2026)

Repetition in thinking mode: Without presence_penalty=1.5, the model tends to loop during extended reasoning. Set this in your inference config. Ollama silently ignores penalty parameters as of v0.17.4 – use llama.cpp or vLLM if you need reliable penalty sampling.

Vision model assertion in llama.cpp: The multimodal model assertion prevents KV cache reuse, forcing full prompt reprocessing every turn. If you’re text-only, don’t pass --mmproj. A fix PR exists but hasn’t merged.

27B CUDA eval bug: Issue #19860 reports CUDA errors during evaluation. Make sure you’re on the latest llama.cpp build.

Qwen 3.5 is NOT Qwen 3. The model architecture is different (Gated DeltaNet vs standard transformer). Chat templates differ. The /think//nothink toggle from Qwen 3 does not work. Don’t assume your Qwen 3 configs transfer cleanly.

Bottom line

The 35B-A3B is the model that matters for most local AI builders. It runs on a $450 GPU at speeds that feel like a cloud API, scores within 5% of the 27B dense on most benchmarks, and fits more context than you’ll probably use. The 27B is the choice when quality per token matters more than throughput. The 122B is for multi-GPU setups that want output matching commercial APIs without the monthly bill.

Q4_K_M, KV cache at Q8, and let --fit handle the layer allocation. That’s the setup. The rest is just choosing your model.