📚 More on this topic: Qwen 3.5 397B Guide · Qwen 3.5 9B Setup Guide · Qwen 3.5 Mac: MLX vs Ollama · Qwen Models Guide · VRAM Requirements · Running LLMs on Mac

Qwen 3.5 shipped four model sizes. The 397B flagship gets the headlines, but it needs 192GB+ of memory. Most people don’t have that.

The three Qwen 3.5 models that run on consumer hardware: 27B dense, 35B-A3B MoE, and 122B-A10B MoE. Same architecture — hybrid attention, 262K native context, built-in vision, Apache 2.0. The difference is how much memory they need and how fast they generate tokens.

Updated April 2026: Qwen 3.6-35B-A3B has shipped. Same MoE shape as 3.5-35B-A3B — 35B total, 3B active, 256 experts — but the agentic coding scores jumped and the llama.cpp throughput regressed. Full comparison in the “Qwen 3.6 dropped” section below. If you’re on a 24GB card and running agents, skip there first.


Qwen 3.6 dropped — and here’s what it means

Qwen 3.6-35B-A3B landed in April 2026. On paper it’s the same architecture as 3.5-35B-A3B: 35 billion total parameters, 3 billion active per token, 256 experts (8 routed + 1 shared per token), 40 layers, 262K native context, multimodal, Apache 2.0.

What actually changed, from the Qwen team’s own numbers and the Unsloth model card:

  • Agentic coding got a real jump. SWE-bench Verified moved from 69.2 → 73.4. Terminal-Bench 2.0 went from 40.5 → 51.5. QwenWebBench frontend generation went from 978 → 1397. If you run agents, Aider, or OpenClaw-style coding loops, that’s a meaningful step up.
  • Thinking preservation across turns. 3.6 was trained to keep <think> traces in context between messages. More consistent multi-turn reasoning and better KV cache reuse in agent workflows. The /think and /nothink shortcut switches from 3.5 are gone — you toggle reasoning through chat_template_kwargs now (see the How to run section).
  • Ollama doesn’t support 3.6 yet. As of April 2026, the only practical path to run 3.6-35B-A3B locally is llama.cpp or a downstream like LM Studio. If Ollama is your whole stack, stick with 3.5 until support lands.
  • ~30% slower than 3.5 in current llama.cpp. An independent 24GB RTX 3090 benchmark puts 3.6-35B-A3B at 101.7 tok/s short-prompt and 80.9 tok/s long-prompt at UD-Q4_K_XL, versus ~142 tok/s for 3.5 on the same card at 65K context. This is a Gated DeltaNet implementation gap, not a fundamental regression — the same shape llama.cpp showed at 3.5 launch before it caught up.
  • Pure coding throughput isn’t universally better. Community HumanEval+ at Q6 has been reported lower on 3.6 than 3.5 — roughly 90% vs 94%. SWE-bench rewards long-horizon agent behavior; HumanEval rewards tight single-shot code. The priorities clearly shifted.
  • CUDA 13.2 quirk. r/LocalLLaMA users report low-bit 3.6 quants producing gibberish on CUDA 13.2 in some configurations. If you see garbage tokens, pin CUDA 13.1 or bump to a newer llama.cpp.

Bottom line for a local user: if you’re running agents, you probably want 3.6. If you want maximum tok/s on well-understood code snippets and Ollama compatibility, 3.5-35B-A3B still wins on throughput for another llama.cpp release or two. Both fit on the same 24GB GPU at Q4.


MoE vs Dense: what changed with Qwen 3.6

Qwen 3.6 only ships as 35B-A3B right now. There is no 3.6-27B dense. So the real fork for a local user is: stay on Qwen 3.5-27B dense, or jump to Qwen 3.6-35B-A3B MoE? That’s a MoE-vs-dense question, not a version bump.

MoE rewards you on speed-per-parameter and breadth of knowledge. A MoE routes each token to a small subset of experts, and different tokens hit different experts. That’s great when the task spans many domains — code, math, languages, reasoning — because the model can specialize without paying full compute per token. The 35B-A3B activates 3B parameters per forward pass, so generation runs at 3B-model speed even though 35B of weights are in VRAM.

Dense rewards you on predictability and strict rule-following. Every parameter works on every token. Some r/LocalLLaMA threads running long-running agent harnesses with tight global rules have reported that dense models obey system prompts more reliably than MoE — if your tokens bounce between experts, global-rule compliance can drift. Nobody’s published hard numbers, but it’s a pattern enough people have hit to take seriously if you’re building an agent rig.

Memory is the other tradeoff. MoE is worse per effective parameter: you load 35B of weights to get 3B of work per token. On a 16GB card you can’t fit 35B at a usable quant without CPU offload, but 27B dense at Q3 fits cleanly. On a 24GB card the fight flips — MoE wins because 3B-active throughput is a big deal and the quality ceiling is higher. Pick MoE (3.5 or 3.6 35B-A3B) if you have 24GB+ and want speed. Pick 27B dense if you have 16GB, or if your agent hates surprises.


The three models at a glance

3.5 27B Dense3.5 35B-A3B MoE3.6 35B-A3B MoE3.5 122B-A10B MoE
Total parameters~28B~36B~36B~125B
Active per token28B (all)3B3B10B
Expert countNone (dense)256 (8 routed + 1 shared)256 (8 routed + 1 shared)MoE
Context window262K262K262K262K
VisionYesYesYesYes
LicenseApache 2.0Apache 2.0Apache 2.0Apache 2.0
VRAM at Q4 (short context)~17 GB~22 GB~23 GB~70 GB
Generation speed (RTX 3090 Q4)~34 tok/s~111 tok/s~101 tok/sMulti-GPU only
Ollama tag availableYesYesNot yetYes

The speed gap between the 27B dense and the 35B MoEs isn’t a typo. MoE models only compute through their active parameters per token. The 35B-A3B activates 3B parameters each forward pass — less than the 27B’s full 28B — so it generates tokens faster despite having more total weight loaded.

The trade-off: MoE models use more memory for their quality level. You’re loading 36B of weights to get 3B of compute. The 27B loads 28B and uses all of it.

3.5 vs 3.6 at the same MoE shape: 3.6 trades about 30% of current llama.cpp throughput for better agentic coding scores and thinking-trace preservation. See the Qwen 3.6 section above for the full comparison.


VRAM requirements

By model and quantization (at 4K context)

ModelQ8_0Q6_KQ4_K_MQ3_K_MQ2_K
3.5 27B (dense)~30 GB~23 GB~17 GB~14 GB~11 GB
3.5 35B-A3B (MoE)~38 GB~30 GB~22 GB~18 GB~14 GB
3.6 35B-A3B (MoE)~38 GB~30 GB~23 GB~17 GB~13 GB
3.5 122B-A10B (MoE)~130 GB~100 GB~70 GB~57 GB~45 GB

3.6 file sizes per the bartowski GGUF repo: Q4_K_M 21.4 GB, Q5_K_S 24.2 GB, Q6_K 30.1 GB, Q8_0 36.9 GB. Add ~1-2 GB for KV cache overhead at short context; more at 128K+.

How context length scales VRAM (Q4_K_M)

Model4K ctx32K ctx128K ctx262K ctx
27B~17 GB~19 GB~27 GB~33 GB
35B-A3B~22 GB~23 GB~24 GB~25 GB
122B-A10B~70 GB~75 GB~85 GB~95 GB

Notice how the 35B-A3B barely grows with context. Going from 4K to 262K only adds ~3GB. That’s the hybrid attention system at work — 75% of the layers use Gated DeltaNet (a linear attention variant) which doesn’t store traditional KV pairs. The 27B dense also uses hybrid attention, but its larger active parameter count means a bigger KV cache per layer.

On a 24GB GPU, the 35B-A3B at Q4 fits for conversations up to roughly 200K tokens. Beyond that, you’ll start hitting the VRAM ceiling. Drop to 128K context if you’re seeing OOM errors — still plenty for most workflows.

Check what fits your setup: VRAM calculator


Speed benchmarks

RTX 3090 (24GB)

ModelQuantPrompt (tok/s)Generation (tok/s)
3.5 35B-A3BQ4_K_M~35111
3.6 35B-A3BUD-Q4_K_XL~101 (short) / ~81 (long)
3.6 35B-A3BUD-Q3_K_M~120 (community report)
3.6 35B-A3BQ5_K_XL~75 @ 10K ctx / ~65 @ 120K ctx
3.5 27BQ4_K_M~2534

3.6 numbers from aminrj’s 24GB llama.cpp benchmark post and community reports on the Qwen 3.6 HF discussions. Exact tok/s depends on context size, KV cache quant (bf16 vs q8_0), and whether flash attention is on.

RTX 5090 (32GB)

ModelQuantGeneration (tok/s)
3.5 35B-A3BQ4_K_M165
3.6 35B-A3BQ4_K_MNo widely-posted community numbers yet. Expect 3.5 minus the ~30% llama.cpp gap.
3.5 27BQ4_K_M~50

Apple Silicon (estimated from memory bandwidth)

ModelQuantGeneration (tok/s)Hardware
35B-A3BQ4_K_M~40-50M4 Max 128GB (MLX)
27BQ4_K_M~25-30M4 Max 128GB (MLX)
122B-A10BQ4_K_M~10-15M4 Max 128GB (MLX)

The 35B MoE at 111 tok/s on an RTX 3090 is faster than most 7B models on the same card. MoE with a 3B active count makes that possible. Mac users can squeeze even more speed out of the 35B with the MLX backend — see our MLX vs Ollama Qwen 3.5 benchmarks for the full comparison.

One caveat: early llama.cpp builds show the 35B-A3B running ~35% slower than its predecessor, the Qwen3-30B-A3B, on CUDA. This appears to be an implementation issue with the new Gated DeltaNet layers, not a fundamental regression. Expect this to improve as llama.cpp adds optimizations. On the MLX backend (Mac), this issue doesn’t apply. Qwen 3.6 inherits the same Gated DeltaNet and shows the same ~30% llama.cpp gap vs 3.5 today — same root cause, same expected trajectory as patches land.


35B-A3B: the model most people should run

Three billion active parameters per token means generation runs at roughly 3B speeds. But the model draws from 36B total parameters across 256 experts, giving it a much larger knowledge base. Each token routes to 8 of the 256 experts plus 1 shared expert. Different tokens hit different combinations, so the model uses all its capacity across a conversation — just not all at once per token.

Benchmark scores (thinking mode on)

BenchmarkScoreWhat it tests
MMLU-Pro85.3Broad knowledge
GPQA Diamond84.2Graduate-level science
SWE-bench Verified69.2Real-world software engineering
AIME 202578.0Math competition
LiveCodeBench v678.1Coding

MMLU-Pro 85.3 and GPQA Diamond 84.2 are scores you’d expect from a much larger dense model. Getting them at 111 tok/s on a consumer GPU is hard to ignore.

Who should pick it

Anyone with a 24GB GPU. The Q4 quantization fits with room for long conversations. If you have an RTX 3090, 4090, or 5090, this is the Qwen 3.5 model to run.

On Mac, it works well on any machine with 48GB+ unified memory. The speed benefit of MoE applies on Apple Silicon too, and the extra memory means you can push context higher.

Where MoE doesn’t help

MoE models load more weight per quality level than dense models. The 35B-A3B at Q4 uses ~22GB to get what is effectively 3B of compute per token. A dense 27B at Q4 uses 17GB and all 28B of parameters work on every token.

If your tasks are simple — summarization, translation, short Q&A — the 27B may give comparable quality at lower memory cost. MoE pulls ahead when you need the depth of hundreds of specialized experts: complex reasoning, diverse coding, multilingual work.


27B dense: when less memory is what you have

The 27B is the straightforward option. All parameters active, all the time. No expert routing, no inactive weights sitting in VRAM.

Pick it when:

  • You have 16GB VRAM. The 35B MoE doesn’t fit at Q4. The 27B fits at Q3 (~14GB) with room for moderate context.
  • You want maximum quality per byte loaded. Every gigabyte of the 27B works on every token. No inactive experts.
  • You run long contexts regularly. Starting from a lower memory base (17GB vs 22GB) gives you more room for KV cache growth.
  • Predictable latency matters. Dense models have consistent per-token timing. MoE can vary slightly depending on which experts activate.

The 27B won’t match the 35B-A3B’s benchmark scores. More total parameters — even when most are inactive per token — does translate to broader capability. But for everyday local inference on constrained hardware, the 27B does the job.


122B-A10B: when you want more quality

The 122B-A10B sits between consumer models and the 397B flagship. Ten billion active parameters per token, 125B total, ~70GB at Q4.

It runs on:

  • Mac Studio M4 Max 128GB — fits at Q4 with ~58GB free for the OS and apps
  • Mac Studio M3 Ultra 256GB — fits at Q8 for higher quality
  • 2x RTX 3090 (48GB combined) — fits at Q3 with tensor parallelism
  • 1x H100 80GB — fits at Q4 in VRAM

Ten billion active parameters (vs 3B for the 35B model) means more compute per token and higher quality on hard tasks — complex coding, detailed document analysis, long-form reasoning. If you have the memory and you’re pushing into quality-sensitive work, the 122B is the upgrade from the 35B.

Most people don’t need it. The 35B-A3B covers the majority of local inference use cases at a fraction of the memory cost.


What’s new in Qwen 3.5 (shared across all sizes)

Hybrid attention

Three quarters of the layers use Gated DeltaNet (a linear attention mechanism) and one quarter use standard full attention. This is why the KV cache barely grows with context length and decoding is 8-19x faster than Qwen 3.

262K native context

Base training covers 262K tokens. YaRN scaling extends it to 1M, though quality degrades beyond the native window.

Built-in vision

All Qwen 3.5 models are multimodal from training, not language models with a vision encoder bolted on afterward. For local inference with GGUF files, you still need the mmproj file alongside the model — same two-file setup as Qwen2.5-VL.

Thinking mode on by default

Every Qwen 3.5 model generates chain-of-thought reasoning before answering. Better quality on hard tasks, but costs extra tokens and time. More on how to toggle it below.


Thinking mode: on or off

Qwen 3.5 generates internal reasoning tokens before every response. On by default. Costs extra context and time, but improves accuracy on complex problems.

Keep it on for:

  • Math, logic, coding
  • Multi-step reasoning
  • Tasks where correctness matters more than speed

Turn it off for:

  • Simple Q&A, translation, summarization
  • Chat where latency matters
  • Tasks where extra reasoning doesn’t change the answer

How to toggle

Ollama:

/set parameter enable_thinking false

llama.cpp:

llama-cli -m model.gguf \
  --chat-template-kwargs '{"enable_thinking": false}'

LM Studio: Look for a thinking mode toggle in model settings. If it’s not exposed, set enable_thinking to false in the inference parameters.

API calls (OpenAI-compatible):

{
  "messages": [...],
  "chat_template_kwargs": {"enable_thinking": false}
}

With thinking off, the model responds faster but scores lower on reasoning benchmarks. With thinking on, responses include hidden reasoning tokens (2-5x longer internally) but measurably better results on hard problems.


How to run

Ollama (Qwen 3.5 only)

# 35B MoE (recommended for 24GB GPUs)
ollama run qwen3.5:35b-a3b

# 27B dense (for tighter cards)
ollama run qwen3.5:27b

# 122B MoE (needs 70GB+)
ollama run qwen3.5:122b-a10b

Ollama selects a quantization automatically based on your available memory. Tag names may vary — check the Ollama library for current tags.

Ollama does not yet support Qwen 3.6-35B-A3B. Use llama.cpp or LM Studio (below) until that lands.

llama.cpp

Download GGUF files from Unsloth (recommended quants), bartowski, or lmstudio-community:

# Qwen 3.5-35B-A3B at Q4
huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF \
  --local-dir Qwen3.5-35B \
  --include "*UD-Q4_K_XL*"

# Qwen 3.6-35B-A3B at Q4 (the current recommended pick for 24GB)
huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
  --local-dir Qwen3.6-35B \
  --include "*UD-Q4_K_XL*"

# Run Qwen 3.6 with llama-server
llama-server \
  -m Qwen3.6-35B/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  -ngl 999 -fa on -c 65536 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --jinja --reasoning-format deepseek

Use -ngl 999 to offload all layers to GPU. With MoE models, another option is to omit -ngl entirely and let llama.cpp auto-tune the expert-tensor placement based on --ctx-size — this often outperforms a blanket full offload when your GGUF is close to your VRAM ceiling. Worth testing both on your card.

If the GGUF is bigger than your VRAM, llama.cpp will shard the MoE experts across CPU RAM and disk. This is how 8GB + 64GB RAM builds run 3.6-35B-A3B at all — community reports of this working exist, but exact tok/s depend heavily on RAM speed and the specific experts your prompt routes to. Expect single-digit to low-teen tok/s at best, not the RTX 3090 numbers.

LM Studio

Search for “Qwen3.5” or “Qwen3.6” in the model browser. Community quants are available for 3.5 (all three sizes) and 3.6-35B-A3B. On Mac, confirm you’re using the MLX backend (Settings → Runtime) for best performance. 3.6 MLX support was added shortly after the llama.cpp PR landed — check for the MLX variant if you’re on Apple Silicon.


Quantization picks

Unsloth recommends their UD-Q4_K_XL format for the best quality-to-size ratio. Their testing shows Q3 and Q4 produce “effectively similar quality” on Qwen 3.5, so you can drop to Q3 if you need memory savings without a major quality hit.

Your VRAM27B pick35B-A3B pick
12 GBQ3_K_M (tight, short context)Too large
16 GBQ3_K_M (comfortable)Too large
24 GBQ6_K or Q8_0Q4_K_M
32 GBQ8_0Q6_K
48 GB+FP16Q8_0

For the 122B-A10B: Q4_K_M on a 128GB Mac, Q3_K_M if you’re tighter on memory.

Avoid going below Q2 on any model. Quality at IQ2 and below degrades noticeably, especially for vision and reasoning tasks.


Pick by hardware

Your setupRecommended modelWhy
8GB VRAM + 64GB RAM3.5 or 3.6 35B-A3B at Q3-Q4 with llama.cpp expert offload, or 9B for speed35B-A3B runs, but slow (community reports: llama-server with CPU-RAM experts, single-digit to low-teen tok/s). The 9B will feel much better interactively.
12GB VRAM (RTX 3060)3.5-27B at Q3, or 9B at Q435B-A3B needs offload at every quant that fits. 27B Q3 fits entirely. 9B is the sanity-preserving pick.
16GB VRAM (RTX 4060 Ti / 4070 Ti Super)3.5-27B at Q3-Q4, or 3.5-35B-A3B at Q3 with light offload3.6 runs too, but the IQ3/Q3 file at 15-17GB is tight against context. Dense 27B is the predictable choice here, especially if you care about agent rule-following.
24GB VRAM (RTX 3090 / 4090)3.5-35B-A3B at Q4 for pure speed. 3.6-35B-A3B at UD-Q4_K_XL or Q5_K_S for agentic coding.3.5 gives you 111 tok/s and Ollama support. 3.6 gives you ~101 tok/s and SWE-bench 73.4. Pick based on use case, not the VRAM spec.
32GB VRAM (RTX 5090)3.5-35B-A3B at Q6, or 3.6-35B-A3B at Q5-Q6Higher quant, more context room. 3.5 hits ~165 tok/s here; 3.6 community numbers at this quant are still coming in.
2x RTX 3090 (48GB)3.5-122B-A10B at Q3, or 3.5/3.6 35B-A3B at Q8Best quality at this tier is the 122B. If you’d rather run the 35B-A3B at full quality, Q8 fits with context headroom.
Mac 48-64GB3.5 or 3.6 35B-A3B at Q4-Q6 via MLXNo llama.cpp Gated DeltaNet gap on MLX. 3.6 should land closer to 3.5’s Mac throughput than its CUDA throughput.
Mac M4 Max 128GB3.5-122B-A10B at Q4Fits with headroom. Real quality step over the 35B.
Mac M3/M4 Ultra 256GB3.5-122B-A10B at Q8Best quality, or run 35B-A3B + another model at the same time.

The 35B-A3B on a 24GB card is the value play. 3.5 if you care about maximum throughput today. 3.6 if you’re running agents or OpenClaw-style coding loops and want the SWE-bench / Terminal-Bench jump. Both fit at Q4 with room for long context.


Qwen 3.6 vs Qwen 3.5 vs Qwen 3: should you switch?

From Qwen 3 (Qwen3-30B-A3B or Qwen3-32B dense): move to 3.5 at minimum. Better benchmarks across the board, 262K context (up from 131K), and native vision. Same architecture family as 3.5 and 3.6, so the same VRAM math applies.

From Qwen 3.5-35B-A3B to Qwen 3.6-35B-A3B: worth it if you run agents, worth holding if you don’t. The agentic coding numbers jumped (SWE-bench 69.2 → 73.4, Terminal-Bench 40.5 → 51.5) and 3.6 preserves <think> traces across turns — useful for long agent runs. You pay with about 30% of current llama.cpp throughput and no Ollama support yet. See the Qwen 3.6 section above for details.

From Qwen 3.5-27B dense to anything MoE: only if you have 24GB+ and you’re not fighting global-rule drift in an agent harness. Dense still has a real case for rigid-rule workflows. For everything else, MoE’s speed-per-quality is the right bet.

The same llama.cpp Gated DeltaNet implementation gap that made 3.5 look slower than Qwen3-30B-A3B at launch is now making 3.6 look slower than 3.5. On MLX (Mac), no slowdown.


Bottom line

For most people with a 24GB GPU, a 35B-A3B MoE is the answer. Pick 3.5 if you want maximum tok/s and Ollama compatibility. Pick 3.6 if you run agents, Aider, or coding loops where SWE-bench 73.4 matters more than throughput. Both fit at Q4 with room for long context.

On a 16GB card, the 27B dense at Q3 is still the sane entry. Slower, but every byte of loaded weight is active and global-rule compliance is more predictable than with MoE.

On an 8GB card with 64GB+ system RAM, llama.cpp’s MoE expert offload can run 35B-A3B at 3.5 or 3.6 — but expect single-digit to low-teen tok/s. The 9B will feel better for interactive use.

Mac Studio owners with 128GB should look at the 122B-A10B at Q4 — a real quality step up without needing the 397B’s 192GB minimum.

Thinking mode is on by default across the Qwen 3.5 / 3.6 lineup. Turn it off for speed, leave it on when accuracy matters. Vision and Apache 2.0 licensing come standard.

If you have the VRAM for a 35B MoE, start there. Which one depends on whether you’re building an agent or chasing tok/s.