Qwen 3.5 Locally — 27B vs 35B-A3B vs 122B, Which Model Fits Your GPU

📚 More on this topic: Qwen 3.6 Complete Guide · Qwen 3.5 Complete Cheat Sheet · Qwen 3.5 397B Guide · Qwen 3.5 9B Setup Guide · Qwen 3.5 Mac: MLX vs Ollama · Qwen Models Guide · VRAM Requirements

Qwen 3.5 shipped four model sizes. The 397B flagship gets the headlines, but it needs 192GB+ of memory. Most people don’t have that.

The three Qwen 3.5 models that run on consumer hardware: 27B dense, 35B-A3B MoE, and 122B-A10B MoE. Same architecture (hybrid attention, 262K native context, built-in vision, Apache 2.0). The difference is how much memory they need and how fast they generate tokens.

This guide is the speed-and-memory view: per-card tok/s, how VRAM grows with context, and the quant to run at each tier. If you want the full lineup instead, including the 397B and the cross-model benchmarks against GPT-5 mini and Sonnet, the complete Qwen 3.5 guide is the model-picker companion to this one.

Updated May 2026: Qwen 3.6-35B-A3B has shipped. Same MoE shape as 3.5-35B-A3B (35B total, 3B active, 256 experts), but the agentic coding scores jumped and the llama.cpp throughput regressed. Full comparison in the “Qwen 3.6 dropped” section below. If you’re on a 24GB card and running agents, skip there first.

Qwen 3.6 dropped — and here’s what it means

Qwen 3.6-35B-A3B landed in April 2026. On paper it’s the same architecture as 3.5-35B-A3B: 35 billion total parameters, 3 billion active per token, 256 experts (8 routed + 1 shared per token), 40 layers, 262K native context, multimodal, Apache 2.0.

What actually changed, from the Qwen team’s own numbers and the Unsloth model card:

Agentic coding got a real jump. SWE-bench Verified moved from 69.2 → 73.4. Terminal-Bench 2.0 went from 40.5 → 51.5. QwenWebBench frontend generation went from 978 → 1397. If you run agents, Aider, or OpenClaw-style coding loops, that’s a meaningful step up.
Thinking preservation across turns. 3.6 was trained to keep <think> traces in context between messages. More consistent multi-turn reasoning and better KV cache reuse in agent workflows. The /think and /nothink shortcut switches from 3.5 are gone — you toggle reasoning through chat_template_kwargs now (see the How to run section).
Ollama now runs 3.6 text. Qwen 3.6 is in the official Ollama library (around 1.8M pulls), so a plain ollama pull covers chat, code, and tools. The one gap is vision: the separate mmproj file still doesn’t wire up cleanly through Ollama, so for image and video input use llama.cpp or LM Studio.
~30% slower than 3.5 in current llama.cpp. An independent 24GB RTX 3090 benchmark puts 3.6-35B-A3B at 101.7 tok/s short-prompt and 80.9 tok/s long-prompt at UD-Q4_K_XL, versus ~142 tok/s for 3.5 on the same card at 65K context. This is a Gated DeltaNet implementation gap, not a fundamental regression — the same shape llama.cpp showed at 3.5 launch before it caught up.
Pure coding throughput isn’t universally better. Community HumanEval+ at Q6 has been reported lower on 3.6 than 3.5 — roughly 90% vs 94%. SWE-bench rewards long-horizon agent behavior; HumanEval rewards tight single-shot code. The priorities clearly shifted.
CUDA 13.2 quirk. r/LocalLLaMA users report low-bit 3.6 quants producing gibberish on CUDA 13.2 specifically. CUDA 13.1 and 13.3 are both fine, so pin one of those (or bump to a newer llama.cpp) if you see garbage tokens; NVIDIA has a fix in progress.

Bottom line for a local user: if you’re running agents, you probably want 3.6. If you want maximum tok/s on well-understood code snippets today, 3.5-35B-A3B still wins on raw throughput for another llama.cpp release or two. Both fit on the same 24GB GPU at Q4, and both now run in Ollama for text. For the dedicated 3.6 breakdown of both open models (27B dense and 35B-A3B), see the Qwen 3.6 complete guide.

MoE vs Dense: what changed with Qwen 3.6

Qwen 3.6 only ships as 35B-A3B right now. There is no 3.6-27B dense. So the real fork for a local user is: stay on Qwen 3.5-27B dense, or jump to Qwen 3.6-35B-A3B MoE? That’s a MoE-vs-dense question, not a version bump.

MoE rewards you on speed-per-parameter and breadth of knowledge. A MoE routes each token to a small subset of experts, and different tokens hit different experts. That’s great when the task spans many domains — code, math, languages, reasoning — because the model can specialize without paying full compute per token. The 35B-A3B activates 3B parameters per forward pass, so generation runs at 3B-model speed even though 35B of weights are in VRAM.

Dense rewards you on predictability and strict rule-following. Every parameter works on every token. Some r/LocalLLaMA threads running long-running agent harnesses with tight global rules have reported that dense models obey system prompts more reliably than MoE — if your tokens bounce between experts, global-rule compliance can drift. Nobody’s published hard numbers, but it’s a pattern enough people have hit to take seriously if you’re building an agent rig.

Memory is the other tradeoff. MoE is worse per effective parameter: you load 35B of weights to get 3B of work per token. On a 16GB card you can’t fit 35B at a usable quant without CPU offload, but 27B dense at Q3 fits cleanly. On a 24GB card the fight flips — MoE wins because 3B-active throughput is a big deal and the quality ceiling is higher. Pick MoE (3.5 or 3.6 35B-A3B) if you have 24GB+ and want speed. Pick 27B dense if you have 16GB, or if your agent hates surprises.

The three models at a glance

	3.5 27B Dense	3.5 35B-A3B MoE	3.6 35B-A3B MoE	3.5 122B-A10B MoE
Total parameters	~28B	~36B	~36B	~125B
Active per token	28B (all)	3B	3B	10B
Expert count	None (dense)	256 (8 routed + 1 shared)	256 (8 routed + 1 shared)	MoE
Context window	262K	262K	262K	262K
Vision	Yes	Yes	Yes	Yes
License	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0
VRAM at Q4 (short context)	~17 GB	~22 GB	~23 GB	~70 GB
Generation speed (RTX 3090 Q4)	~34 tok/s	~111 tok/s	~101 tok/s	Multi-GPU only
Ollama tag available	Yes	Yes	Yes (text)	Yes

The speed gap between the 27B dense and the 35B MoEs isn’t a typo. MoE models only compute through their active parameters per token. The 35B-A3B activates 3B parameters each forward pass — less than the 27B’s full 28B — so it generates tokens faster despite having more total weight loaded.

The trade-off: MoE models use more memory for their quality level. You’re loading 36B of weights to get 3B of compute. The 27B loads 28B and uses all of it.

3.5 vs 3.6 at the same MoE shape: 3.6 trades about 30% of current llama.cpp throughput for better agentic coding scores and thinking-trace preservation. See the Qwen 3.6 section above for the full comparison.

VRAM requirements

By model and quantization (at 4K context)

Model	Q8_0	Q6_K	Q4_K_M	Q3_K_M	Q2_K
3.5 27B (dense)	~30 GB	~23 GB	~17 GB	~14 GB	~11 GB
3.5 35B-A3B (MoE)	~38 GB	~30 GB	~22 GB	~18 GB	~14 GB
3.6 35B-A3B (MoE)	~38 GB	~30 GB	~23 GB	~17 GB	~13 GB
3.5 122B-A10B (MoE)	~130 GB	~100 GB	~70 GB	~57 GB	~45 GB

3.6 file sizes per the bartowski GGUF repo: Q4_K_M 21.4 GB, Q5_K_S 24.2 GB, Q6_K 30.1 GB, Q8_0 36.9 GB. Add ~1-2 GB for KV cache overhead at short context; more at 128K+.

How context length scales VRAM (Q4_K_M)

Model	4K ctx	32K ctx	128K ctx	262K ctx
27B	~17 GB	~19 GB	~27 GB	~33 GB
35B-A3B	~22 GB	~23 GB	~24 GB	~25 GB
122B-A10B	~70 GB	~75 GB	~85 GB	~95 GB

Notice how the 35B-A3B barely grows with context. Going from 4K to 262K only adds ~3GB. That’s the hybrid attention system at work — 75% of the layers use Gated DeltaNet (a linear attention variant) which doesn’t store traditional KV pairs. The 27B dense also uses hybrid attention, but its larger active parameter count means a bigger KV cache per layer.

On a 24GB GPU, the 35B-A3B at Q4 fits for conversations up to roughly 200K tokens. Beyond that, you’ll start hitting the VRAM ceiling. Drop to 128K context if you’re seeing OOM errors — still plenty for most workflows.

Check what fits your setup: VRAM calculator

Speed benchmarks

RTX 3090 (24GB)

Model	Quant	Prompt (tok/s)	Generation (tok/s)
3.5 35B-A3B	Q4_K_M	~35	111
3.6 35B-A3B	UD-Q4_K_XL	—	~101 (short) / ~81 (long)
3.6 35B-A3B	UD-Q3_K_M	—	~120 (community report)
3.6 35B-A3B	Q5_K_XL	—	~75 @ 10K ctx / ~65 @ 120K ctx
3.5 27B	Q4_K_M	~25	34

3.6 numbers from aminrj’s 24GB llama.cpp benchmark post and community reports on the Qwen 3.6 HF discussions. Exact tok/s depends on context size, KV cache quant (bf16 vs q8_0), and whether flash attention is on.

RTX 5090 (32GB)

Model	Quant / engine	Generation (tok/s)
3.5 35B-A3B	Q4_K_M, llama.cpp	~165 (community estimate)
3.5 35B-A3B	GPTQ-Int4, vLLM + fp8 KV	194-197 (sourced)
3.6 35B-A3B	Q4_K_M, llama.cpp	No widely-posted community numbers yet. Expect 3.5 minus the ~30% llama.cpp gap.
3.5 27B	Q4_K_M, llama.cpp	~50

On the 5090 the engine matters as much as the card. The firmly-sourced number is ~194-197 tok/s from a vLLM + GPTQ-Int4 setup; the llama.cpp GGUF figure is a community estimate and runs lower, the same GGUF-versus-vLLM gap you’d see on any card.

Apple Silicon (estimated from memory bandwidth)

Model	Quant	Generation (tok/s)	Hardware
35B-A3B	Q4_K_M	~40-50	M4 Max 128GB (MLX)
27B	Q4_K_M	~25-30	M4 Max 128GB (MLX)
122B-A10B	Q4_K_M	~10-15	M4 Max 128GB (MLX)

The 35B MoE at 111 tok/s on an RTX 3090 is faster than most 7B models on the same card. MoE with a 3B active count makes that possible. Mac users can squeeze even more speed out of the 35B with the MLX backend — see our MLX vs Ollama Qwen 3.5 benchmarks for the full comparison.

One caveat: early llama.cpp builds show the 35B-A3B running ~35% slower than its predecessor, the Qwen3-30B-A3B, on CUDA. This appears to be an implementation issue with the new Gated DeltaNet layers, not a fundamental regression. Expect this to improve as llama.cpp adds optimizations. On the MLX backend (Mac), this issue doesn’t apply. Qwen 3.6 inherits the same Gated DeltaNet and shows the same ~30% llama.cpp gap vs 3.5 today — same root cause, same expected trajectory as patches land.

35B-A3B: the model most people should run

Three billion active parameters per token means generation runs at roughly 3B speeds. But the model draws from 36B total parameters across 256 experts, giving it a much larger knowledge base. Each token routes to 8 of the 256 experts plus 1 shared expert. Different tokens hit different combinations, so the model uses all its capacity across a conversation — just not all at once per token.

Benchmark scores (thinking mode on)

Benchmark	Score	What it tests
MMLU-Pro	85.3	Broad knowledge
GPQA Diamond	84.2	Graduate-level science
SWE-bench Verified	69.2	Real-world software engineering
AIME 2025	78.0	Math competition
LiveCodeBench v6	78.1	Coding

MMLU-Pro 85.3 and GPQA Diamond 84.2 are scores you’d expect from a much larger dense model. Getting them at 111 tok/s on a consumer GPU is hard to ignore.

Who should pick it

Anyone with a 24GB GPU. The Q4 quantization fits with room for long conversations. If you have an RTX 3090, 4090, or 5090, this is the Qwen 3.5 model to run.

On Mac, it works well on any machine with 48GB+ unified memory. The speed benefit of MoE applies on Apple Silicon too, and the extra memory means you can push context higher.

Where MoE doesn’t help

MoE models load more weight per quality level than dense models. The 35B-A3B at Q4 uses ~22GB to get what is effectively 3B of compute per token. A dense 27B at Q4 uses 17GB and all 28B of parameters work on every token.

If your tasks are simple — summarization, translation, short Q&A — the 27B may give comparable quality at lower memory cost. MoE pulls ahead when you need the depth of hundreds of specialized experts: complex reasoning, diverse coding, multilingual work.

27B dense: when less memory is what you have

The 27B is the straightforward option. All parameters active, all the time. No expert routing, no inactive weights sitting in VRAM.

Pick it when:

You have 16GB VRAM. The 35B MoE doesn’t fit at Q4. The 27B fits at Q3 (~14GB) with room for moderate context.
You want maximum quality per byte loaded. Every gigabyte of the 27B works on every token. No inactive experts.
You run long contexts regularly. Starting from a lower memory base (17GB vs 22GB) gives you more room for KV cache growth.
Predictable latency matters. Dense models have consistent per-token timing. MoE can vary slightly depending on which experts activate.

The 27B won’t match the 35B-A3B’s benchmark scores. More total parameters — even when most are inactive per token — does translate to broader capability. But for everyday local inference on constrained hardware, the 27B does the job.

122B-A10B: when you want more quality

The 122B-A10B sits between consumer models and the 397B flagship. Ten billion active parameters per token, 125B total, ~70GB at Q4.

It runs on:

Mac Studio M4 Max 128GB — fits at Q4 with ~58GB free for the OS and apps
Mac Studio M3 Ultra 256GB — fits at Q8 for higher quality
2x RTX 3090 (48GB combined) — fits at Q3 with tensor parallelism
1x H100 80GB — fits at Q4 in VRAM

Ten billion active parameters (vs 3B for the 35B model) means more compute per token and higher quality on hard tasks — complex coding, detailed document analysis, long-form reasoning. If you have the memory and you’re pushing into quality-sensitive work, the 122B is the upgrade from the 35B.

Most people don’t need it. The 35B-A3B covers the majority of local inference use cases at a fraction of the memory cost.

What’s new in Qwen 3.5 (shared across all sizes)

Hybrid attention

Three quarters of the layers use Gated DeltaNet (a linear attention mechanism) and one quarter use standard full attention. This is why the KV cache barely grows with context length and decoding is 8-19x faster than Qwen 3.

262K native context

Base training covers 262K tokens. YaRN scaling extends it to 1M, though quality degrades beyond the native window.

Built-in vision

All Qwen 3.5 models are multimodal from training, not language models with a vision encoder bolted on afterward. For local inference with GGUF files, you still need the mmproj file alongside the model — same two-file setup as Qwen2.5-VL.

Thinking mode on by default

Every Qwen 3.5 model generates chain-of-thought reasoning before answering. Better quality on hard tasks, but costs extra tokens and time. More on how to toggle it below.

Thinking mode: on or off

Qwen 3.5 generates internal reasoning tokens before every response. On by default. Costs extra context and time, but improves accuracy on complex problems.

Keep it on for:

Math, logic, coding
Multi-step reasoning
Tasks where correctness matters more than speed

Turn it off for:

Simple Q&A, translation, summarization
Chat where latency matters
Tasks where extra reasoning doesn’t change the answer

How to toggle

Ollama:

/set parameter enable_thinking false

llama.cpp:

llama-cli -m model.gguf \
  --chat-template-kwargs '{"enable_thinking": false}'

LM Studio: Look for a thinking mode toggle in model settings. If it’s not exposed, set enable_thinking to false in the inference parameters.

API calls (OpenAI-compatible):

{
  "messages": [...],
  "chat_template_kwargs": {"enable_thinking": false}
}

With thinking off, the model responds faster but scores lower on reasoning benchmarks. With thinking on, responses include hidden reasoning tokens (2-5x longer internally) but measurably better results on hard problems.

How to run

Ollama

# 35B MoE (recommended for 24GB GPUs)
ollama run qwen3.5:35b-a3b

# 27B dense (for tighter cards)
ollama run qwen3.5:27b

# 122B MoE (needs 70GB+)
ollama run qwen3.5:122b-a10b

Ollama selects a quantization automatically based on your available memory. Tag names may vary — check the Ollama library for current tags.

Qwen 3.6 text now works in Ollama too (library entry, ollama pull qwen3.6). For 3.6 vision, use llama.cpp or LM Studio (below), since the mmproj file still doesn’t wire up cleanly through Ollama.

llama.cpp

Download GGUF files from Unsloth (recommended quants), bartowski, or lmstudio-community:

# Qwen 3.5-35B-A3B at Q4
huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
  --local-dir Qwen3.5-35B \
  --include "*UD-Q4_K_XL*"

# Qwen 3.6-35B-A3B at Q4 (the current recommended pick for 24GB)
huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
  --local-dir Qwen3.6-35B \
  --include "*UD-Q4_K_XL*"

# Run Qwen 3.6 with llama-server
llama-server \
  -m Qwen3.6-35B/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  -ngl 999 -fa on -c 65536 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --reasoning-format deepseek

Use -ngl 999 to offload all layers to GPU. With MoE models, another option is to omit -ngl entirely and let llama.cpp auto-tune the expert-tensor placement based on --ctx-size — this often outperforms a blanket full offload when your GGUF is close to your VRAM ceiling. Worth testing both on your card.

If the GGUF is bigger than your VRAM, llama.cpp will shard the MoE experts across CPU RAM and disk. This is how 8GB + 64GB RAM builds run 3.6-35B-A3B at all — community reports of this working exist, but exact tok/s depend heavily on RAM speed and the specific experts your prompt routes to. Expect single-digit to low-teen tok/s at best, not the RTX 3090 numbers.

LM Studio

Search for “Qwen3.5” or “Qwen3.6” in the model browser. Community quants are available for 3.5 (all three sizes) and 3.6-35B-A3B. On Mac, confirm you’re using the MLX backend (Settings → Runtime) for best performance. 3.6 MLX support was added shortly after the llama.cpp PR landed — check for the MLX variant if you’re on Apple Silicon.

Quantization picks

Unsloth recommends their UD-Q4_K_XL format for the best quality-to-size ratio. Their testing shows Q3 and Q4 produce “effectively similar quality” on Qwen 3.5, so you can drop to Q3 if you need memory savings without a major quality hit.

Your VRAM	27B pick	35B-A3B pick
12 GB	Q3_K_M (tight, short context)	Too large
16 GB	Q3_K_M (comfortable)	Too large
24 GB	Q6_K or Q8_0	Q4_K_M
32 GB	Q8_0	Q6_K
48 GB+	FP16	Q8_0

For the 122B-A10B: Q4_K_M on a 128GB Mac, Q3_K_M if you’re tighter on memory.

Avoid going below Q2 on any model. Quality at IQ2 and below degrades noticeably, especially for vision and reasoning tasks.

Pick by hardware

Your setup	Recommended model	Why
8GB VRAM + 64GB RAM	3.5 or 3.6 35B-A3B at Q3-Q4 with llama.cpp expert offload, or 9B for speed	35B-A3B runs, but slow (community reports: llama-server with CPU-RAM experts, single-digit to low-teen tok/s). The 9B will feel much better interactively.
12GB VRAM (RTX 3060)	3.5-27B at Q3, or 9B at Q4	35B-A3B needs offload at every quant that fits. 27B Q3 fits entirely. 9B is the sanity-preserving pick.
16GB VRAM (RTX 4060 Ti / 4070 Ti Super)	3.5-27B at Q3-Q4, or 3.5-35B-A3B at Q3 with light offload	3.6 runs too, but the IQ3/Q3 file at 15-17GB is tight against context. Dense 27B is the predictable choice here, especially if you care about agent rule-following.
24GB VRAM (RTX 3090 / 4090)	3.5-35B-A3B at Q4 for pure speed. 3.6-35B-A3B at UD-Q4_K_XL or Q5_K_S for agentic coding.	3.5 gives you 111 tok/s. 3.6 gives you ~101 tok/s and SWE-bench 73.4. Both run in Ollama for text now, so pick based on use case, not the VRAM spec.
32GB VRAM (RTX 5090)	3.5-35B-A3B at Q6, or 3.6-35B-A3B at Q5-Q6	Higher quant, more context room. 3.5 hits ~165 tok/s here; 3.6 community numbers at this quant are still coming in.
2x RTX 3090 (48GB)	3.5-122B-A10B at Q3, or 3.5/3.6 35B-A3B at Q8	Best quality at this tier is the 122B. If you’d rather run the 35B-A3B at full quality, Q8 fits with context headroom.
Mac 48-64GB	3.5 or 3.6 35B-A3B at Q4-Q6 via MLX	No llama.cpp Gated DeltaNet gap on MLX. 3.6 should land closer to 3.5’s Mac throughput than its CUDA throughput.
Mac M4 Max 128GB	3.5-122B-A10B at Q4	Fits with headroom. Real quality step over the 35B.
Mac M3/M4 Ultra 256GB	3.5-122B-A10B at Q8	Best quality, or run 35B-A3B + another model at the same time.

The 35B-A3B on a 24GB card is the value play. 3.5 if you care about maximum throughput today. 3.6 if you’re running agents or OpenClaw-style coding loops and want the SWE-bench / Terminal-Bench jump. Both fit at Q4 with room for long context.

Qwen 3.6 vs Qwen 3.5 vs Qwen 3: should you switch?

From Qwen 3 (Qwen3-30B-A3B or Qwen3-32B dense): move to 3.5 at minimum. Better benchmarks across the board, 262K context (up from 131K), and native vision. Same architecture family as 3.5 and 3.6, so the same VRAM math applies.

From Qwen 3.5-35B-A3B to Qwen 3.6-35B-A3B: worth it if you run agents, worth holding if you don’t. The agentic coding numbers jumped (SWE-bench 69.2 → 73.4, Terminal-Bench 40.5 → 51.5) and 3.6 preserves <think> traces across turns — useful for long agent runs. You pay with about 30% of current llama.cpp throughput and no Ollama support yet. See the Qwen 3.6 section above for details.

From Qwen 3.5-27B dense to anything MoE: only if you have 24GB+ and you’re not fighting global-rule drift in an agent harness. Dense still has a real case for rigid-rule workflows. For everything else, MoE’s speed-per-quality is the right bet.

The same llama.cpp Gated DeltaNet implementation gap that made 3.5 look slower than Qwen3-30B-A3B at launch is now making 3.6 look slower than 3.5. On MLX (Mac), no slowdown.

Bottom line

For most people with a 24GB GPU, a 35B-A3B MoE is the answer. Pick 3.5 if you want maximum tok/s today. Pick 3.6 if you run agents, Aider, or coding loops where SWE-bench 73.4 matters more than throughput. Both fit at Q4 with room for long context, and both run in Ollama for text now.

On a 16GB card, the 27B dense at Q3 is still the sane entry. Slower, but every byte of loaded weight is active and global-rule compliance is more predictable than with MoE.

On an 8GB card with 64GB+ system RAM, llama.cpp’s MoE expert offload can run 35B-A3B at 3.5 or 3.6 — but expect single-digit to low-teen tok/s. The 9B will feel better for interactive use.

Mac Studio owners with 128GB should look at the 122B-A10B at Q4 — a real quality step up without needing the 397B’s 192GB minimum.

Thinking mode is on by default across the Qwen 3.5 / 3.6 lineup. Turn it off for speed, leave it on when accuracy matters. Vision and Apache 2.0 licensing come standard.

If you have the VRAM for a 35B MoE, start there. Which one depends on whether you’re building an agent or chasing tok/s.