Best Way to Run Qwen 3.6 35B MoE Locally: VRAM, Speed, Setup

📚 More on this topic: Qwen 3.6 Complete Guide · MoE Models Explained · Best Local Coding Models · VRAM Requirements · llama.cpp vs Ollama vs vLLM

If you have 24GB VRAM and you’ve been running Qwen 3.6-27B dense, here’s the question. Would you trade for the MoE 35B-A3B?

The honest answer is “it depends, and the dependencies are not what you’d guess.” More total parameters. Fewer active. Different speed profile. Different tool-use behavior. And the DFlash 2x speedup that landed yesterday for the 27B dense does not work on the MoE.

This piece is the hardware-reality view of the 35B MoE. What runs where, how fast, and the setup commands that actually work in late April 2026.

Image: Qwen 3.6-27B dense vs 35B-A3B MoE comparison chart on RTX 3090, tok/s, VRAM, active parameters

What about low-VRAM hardware? (May 2026)

The hardware table above starts at a 12GB RTX 3060. The actual floor is lower than that. YouTube creator Codacus demonstrated Qwen 3.6-35B-A3B running on an 8-year-old GTX 1060 with 6GB VRAM, an Intel i3-8100 (4 cores, no hyperthreading), and 24GB DDR4 RAM at 17 tok/s with 256K context — production-stable for week-long uptime.

Five flags get you there, all mainline llama.cpp (no fork required):

--n-cpu-moe N (N = layer count). Pins MoE expert blocks to CPU while keeping attention on GPU. The breakthrough flag — moves the GTX 1060 from 3 → 10 tok/s by itself.
--no-mmap. Forces the full model into RAM upfront, avoiding page faults during inference. 10 → 13.5 tok/s.
Tune --n-cpu-moe down (41 → 35). Pulls some experts back onto GPU using spare VRAM. 13.5 → 17 tok/s, but context window shrinks 100K → 64K.
TurboQuant KV cache quantization (Google DeepMind’s random-rotation method). K=4 bits, V=3 bits, asymmetric because the model uses 8:1 grouped-query attention. Context goes 64K → 256K with no speed loss and negligible quality drop. Same 17 tok/s.
mlock + Docker IPC_LOCK + --mlock. Stops the kernel from paging memory back to disk. Same 17 tok/s, but survives week-long uptime without slow degradation.

What didn’t work: speculative decoding with Qwen 3.5 0.8B as drafter dropped throughput to 11 tok/s despite a 65% accept rate. MoE batched verification pulls from up to 64 different experts per layer (memory thrash), and 30 of 40 layers are state-space layers that can’t parallelize across a draft window. Spec decode does not help this MoE on current methods. If you’re running the Qwen 3.6-27B dense instead, that’s where DFlash applies — see DFlash on RTX 3090 and the DFlash vs MTP head-to-head.

This isn’t an alternative to the setup above. It’s the floor showing what’s possible when you’re hardware-constrained.

What “35B-A3B” actually means

A3B is the part that matters. The model holds 35 billion parameters in memory. For each token it predicts, it routes through 8 of 256 experts plus 1 shared expert. That’s roughly 3 billion parameters worth of compute per forward pass. You pay memory like a 35B and compute like a 3B. That asymmetry is the whole MoE pitch.

Two consequences that change your hardware decision:

More VRAM than 27B dense. All 35B of weights have to live somewhere. UD-Q4_K_M is 22.1 GB versus ~17 GB for the 27B Q4. Same VRAM ballpark, but tighter on a 24GB card once you add KV cache.
Faster generation than a 27B dense at the same quant. 3B-active means the GPU does roughly 3B’s worth of math per token. On an RTX 3090 the 35B-A3B benchmarks ~100 tok/s while the 27B dense lives around 35 tok/s. About 3x faster, despite being a bigger model on disk.

If MoE architecture is new to you, the MoE models explained primer covers the routing logic and why the 256-expert design works for local inference. The short version: experts are small (512-dim intermediate per expert in 3.6), routing is cheap, and the math works out in your favor whenever you’re bottlenecked on memory bandwidth. Most consumer GPUs are.

Hardware reality

Numbers below are from the Unsloth GGUF model card, Amine Raji’s RTX 3090 benchmark, and r/LocalLLaMA week-one reports. All on llama.cpp builds b8954 or newer with --flash-attn on, --cache-type-k q8_0 --cache-type-v q8_0, and 65K context unless noted. Tok/s is generation speed, batch=1, greedy.

Hardware	Recommended Quant	File Size	Tok/s (gen)	–cpu-moe?
RTX 3090 (24GB)	UD-Q4_K_XL	22.4 GB	~101	No
RTX 4090 (24GB)	UD-Q4_K_XL or UD-Q5_K_S	22.4 / 24.9 GB	~120-140	No
RTX 5090 (32GB)	UD-Q5_K_M or UD-Q6_K	26.5 / 29.3 GB	~160-200	No
RTX 5070 Ti (16GB) + 32GB RAM	UD-Q4_K_M	22.1 GB	~25-35	Yes
Dual RTX 5060 Ti (32GB total)	UD-Q4_K_M or UD-Q5_K_S	22.1 / 24.9 GB	~50-70	Optional
RTX 3060 12GB + 64GB RAM	UD-Q3_K_M	16.6 GB	~12-18	Yes
Apple M3 Ultra (96-192GB)	Q6_K MLX	29.3 GB	~35-45	N/A
Apple M2 Pro 32GB	UD-Q3_K_M	16.6 GB	~18-22	N/A

Caveats. RTX 4090 numbers are extrapolated from 3090 throughput plus the typical 4090 advantage on llama.cpp; community posts back this up roughly. RTX 5090 numbers are from a few r/LocalLLaMA reports running early Blackwell builds. Mac numbers are MLX, not GGUF. Metal MoE handling improved a lot in MLX 0.21+.

The 5070 Ti 16GB + 32GB RAM thread on r/LocalLLaMA is the one that surprised people. With --cpu-moe parking the 8 routed experts in system RAM and keeping the shared expert plus attention on GPU, a 16GB card runs the full Q4 model. Slow, but real, and the quality is the full Q4 model rather than a more aggressive quant. For a chat workload that’s a meaningful trade against running UD-IQ2 on-GPU for speed.

–cpu-moe explained, briefly

--cpu-moe is a llama.cpp flag (landed in b8954-era builds) that tells the engine to keep MoE expert weights in CPU RAM and stream them across PCIe per token. The shared expert and attention layers stay on GPU. For a 3B-active MoE the per-token data movement is small enough that the speed hit isn’t fatal. You’re still doing 3B-active math, just with a hop across the bus to fetch which 8 experts to use.

It only makes sense when:

The model doesn’t fit fully in VRAM (16GB cards, 12GB cards)
You have enough system RAM to hold the expert weights (32GB+ for Q4, 64GB+ for Q6)
You can tolerate ~3x slower generation than fully on-GPU

For a 24GB+ card the answer is don’t use it. Full GPU is always faster.

35B MoE vs 27B dense on the same hardware

Here’s the trade table on a single RTX 3090 with 24GB VRAM, Q4 class quants, 64K context:

Metric	Qwen 3.6-27B dense (Q4_K_M)	Qwen 3.6-35B-A3B (UD-Q4_K_XL)
File size	~17 GB	22.4 GB
VRAM used (with KV)	~21 GB	~24 GB
Generation speed	~25-35 tok/s	~80-101 tok/s
Generation speed with DFlash	60-78 tok/s (2x)	not supported
SWE-bench Verified	77.2	73.4
Terminal-Bench 2.0	59.3	51.5
Tool-call reliability (week-one reports)	strong	mixed

So on raw tok/s the MoE wins by 3x without DFlash. With DFlash on the dense, the gap shrinks to roughly even. On agentic-coding scores the dense wins. On general chat or RAG workloads the MoE is the better experience.

The interesting wrinkle: the MoE leaves zero KV-cache headroom on a 3090 at UD-Q4_K_XL with 65K context. Push to 128K and you’ll need UD-Q3_K_M or KV cache offload. The 27B dense gives you more context for the same card.

The Mac path: MLX, not GGUF

Mac users running this on llama.cpp Metal will get it working but slower than they should. The Metal kernels for the Gated DeltaNet layers are still catching up to CUDA. MLX is the right path on Apple Silicon. The Qwen MLX vs Ollama guide walks through the install, and the MLX team shipped 35B-A3B quants in the first week.

For RAM-poor Macs, look for the 3-bit MLX quants in the baa-ai and mlx-community repos. A 3-bit 35B-A3B lands around 14-16 GB and runs on a 16GB M1 Pro. Quality is noticeably below Q4, but it’s the only way to get 35B onto a base Mac. The 27B dense at 4-bit MLX is a better quality-per-byte trade if your Mac has 24GB+ unified memory. See the best local LLMs for Mac writeup for the broader picture.

Setup walkthrough

The fastest path that works today is llama.cpp built from source plus the Unsloth UD quants. Three steps.

1. Build llama.cpp from a recent commit

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j

You need a build at b8954 or later for --cpu-moe and the latest Gated DeltaNet kernels. Pin CUDA 13.1 or earlier. Multiple reports of low-bit 3.6 quants producing gibberish on CUDA 13.2. The issue is in the cuBLAS path, not llama.cpp itself. Amine Raji flagged it explicitly. Unsloth’s docs flag it. If you see garbage tokens, this is the first thing to check.

2. Pull the GGUF + mmproj

The 35B-A3B is multimodal. There are two files: the main GGUF and an mmproj vision sidecar. llama.cpp won’t load the model without mmproj on recent builds, even for text-only inference.

huggingface-cli download unsloth/Qwen3.6-35B-A3B-GGUF \
  Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf mmproj-F16.gguf \
  --local-dir ./qwen36-35b

UD-Q4_K_XL is the Unsloth Dynamic 4-bit “extra large” quant. It’s the recommended balance per their model card. UD-Q3_K_M is the 16GB-card alternative. The full quant table is on the Unsloth GGUF page.

3. Run llama-server

./build/bin/llama-server \
  --model ./qwen36-35b/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  --mmproj ./qwen36-35b/mmproj-F16.gguf \
  --alias "qwen36-35b-a3b" \
  --ctx-size 65536 \
  --n-gpu-layers 99 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 20 \
  --presence-penalty 1.5 \
  --min-p 0.00 \
  --port 8001

For a 16GB card add --cpu-moe and drop --n-gpu-layers to fit attention + shared expert in VRAM. For instruct (no thinking) mode, swap to --temp 0.7 --top-p 0.8 and pass --chat-template-kwargs '{"enable_thinking":false}'.

LM Studio works too. Point it at the same GGUF and set the mmproj field in the model config. Ollama is broken for 3.6 GGUFs as of late April. The mmproj sidecar isn’t supported in the Ollama loader yet. Stay on Qwen 3.5-35B-A3B in Ollama until that lands. The depth comparison is in llama.cpp vs Ollama vs vLLM.

When to pick the MoE over the 27B dense

Pick 35B-A3B if:

You want fast everyday chat and RAG workloads on a 24GB card. 100 tok/s feels different from 30 tok/s.
You only have 16GB VRAM and 32GB+ system RAM. CPU-MoE offload makes the full Q4 reachable. The 27B dense with KV offload is slower and worse.
You’re on Apple Silicon with 64GB+ unified memory. MLX handles MoE well now.
You want broader world knowledge in the weights. 35B of stored capacity matters for trivia and niche-domain work.

Pick 27B dense if:

You code primarily and tool-use reliability matters. The 35B-A3B has community reports of dropping tool calls on long agent loops.
You need the DFlash 2x speedup. It’s NVIDIA-only, sm_86+, and only supports the dense.
You want strict instruction-following under heavy system prompts. MoE routing drift bites here.
You want more KV headroom on a 24GB card at long context.

The full Qwen 3.6 guide walks through the dense side in detail and covers the Max-Preview cloud-only variant for completeness.

Honest limits

A few rough edges that are real, week-one of late April 2026:

Tool calling regressions. r/LocalLLaMA threads include reports of the 35B-A3B repeating failed tool calls without reading back context, and skipping tool calls entirely on multi-turn loops. The 27B dense is steadier here. If you’re wiring this into Claude Code, OpenCode, or an MCP harness, test before you commit.
Web search broken in some llama.cpp builds. The web-search tool path through llama-server has been reported broken on builds between b8954 and b8967 specifically for the 35B-A3B. Pin to a known-good build or compile head-of-tree. Issue is open on GitHub.
No DFlash support for MoE. The block-diffusion speculative decoding speedup that doubles 27B dense throughput on a 3090 does not extend to the 35B-A3B. The DFlash team hasn’t said when or whether it will. The MoE routing changes the draft-verify math.
CUDA 13.2 gibberish. Already covered. Pin 13.1, or use a build that has the cuBLAS workaround merged.
Speculative decoding via llama.cpp’s standard path doesn’t help. PR #19493 added MoE-aware speculative decoding, but per community testing the routing overhead eats the win on the 35B-A3B specifically. MTP through vLLM and SGLang is the better path if you need additional speed.

None of these are fatal. Most will be fixed in days, not months. But they’re current as of writing and worth knowing.

Bottom line

The 35B MoE is the right pick for fast general-purpose local AI on a 24GB card or with smart RAM offload on a 16GB card. The 27B dense is the right pick for coding and agentic work on the same hardware, especially with DFlash. Both are Apache 2.0, both are 262K context, both are natively multimodal.

If you only download one, ask yourself which workload dominates. Code all day? 27B dense. Chat, RAG, summarization, multilingual? 35B-A3B. If you have the disk for both, keep both. They cover different jobs and switching takes a llama-server restart.