More on this topic: Qwen 3 Complete Guide | Qwen 3.5 Mac: MLX vs Ollama | VRAM Requirements | Best Local LLMs for Mac | llama.cpp vs Ollama vs vLLM

Alibaba dropped three Qwen 3.5 models on February 24, 2026, and the local AI community lost its mind. A 35B model that runs at 44 tok/s on a $450 GPU. A 27B dense model that matches DeepSeek-V3.2 on reasoning. A 122B MoE that beats GPT-5 mini on tool use by 30%. All Apache 2.0. All runnable on hardware you can buy today.

This guide covers all three models, what hardware they need, which quantization to pick, and the gotchas nobody tells you about until you’re staring at garbage output.


The three models at a glance

ModelTotal paramsActive per tokenArchitectureThe pitch
35B-A3B35B3BMoE (256 experts, 9 active)Fast. Fits 16GB VRAM.
27B27B27BDense (all params active)Smartest model under 230B. Single GPU at Q4.
122B-A10B122B10BMoE (256 experts, 9 active)Multi-GPU only. Beats GPT-5 mini on tool use.

All three share the same architecture innovation: Gated Delta Networks. Three out of every four layers use linear attention (O(n) scaling), with every fourth layer using full quadratic attention. The result is lower KV cache memory and faster decoding than standard transformers. They also share a 262K native context window (extendable to 1M via YaRN), native multimodal (text + image + video), and a 248K vocabulary covering 201 languages.

There’s also a 397B-A17B flagship from the initial release, but that needs 192GB+ memory and is a different conversation.


Qwen 3.5-35B-A3B: the star of the show

This is the model that changed the math for 16GB GPU owners.

35 billion total parameters across 256 experts, but only 3 billion activate per token (8 routed + 1 shared). You’re loading a 35B model into memory but computing through a 3B model on every forward pass. The speed reflects that: on an RTX 5060 Ti with 16GB VRAM, llama-bench measured 44.3 tok/s at 100K context.

That’s not a cherry-picked number at short context. That’s a hundred thousand tokens of conversation history, generating at speeds that feel instant.

VRAM requirements: 35B-A3B

QuantizationFile sizeMemory neededFits on
IQ3_XS14.5 GB~16 GBRTX 4060 Ti 16GB (tight)
Q3_K_M16.1 GB~18 GB16GB GPU + some offload
Q4_K_M21.2 GB~22 GBRTX 3090, 4090, 5090
Q5_K_M24.8 GB~26 GBRTX 3090 with limited context
Q6_K28.7 GB~30 GBRTX 4090/5090
Q8_036.9 GB~38 GB48GB+ GPU or Mac 48GB+
BF1669.4 GB~70 GBMulti-GPU or Mac 96GB+

The GQA design (only 2 KV heads) keeps the KV cache tiny. Even at 262K context with Q4 weights, total VRAM stays around 25 GB. That’s the whole model plus full context fitting on a single RTX 3090.

Speed benchmarks: 35B-A3B

HardwareQuantContextGeneration (tok/s)Prompt (tok/s)
RTX 5090 32GBQ4_K_XL5121947,026
RTX 5090 32GBQ4_K262K97.32,003
RTX 5080 16GBQ4_K_M4K~75
RTX 5060 Ti 16GBQ4100K44.3
RTX 3090 24GBQ4_K4K111.22,622
RTX 3090 24GBQ4_K131K79.41,288
AMD R9700 (Vulkan)Q4_K_XL512127.42,713
Strix Halo (Q8)4K38.5960
Tesla V100 32GBQ5_K_XL12838.4570
Mac M4 Max 64GBQ4 MLX~70
CPU (DDR5)Q4~5-6

The RTX 3090 at 111 tok/s is worth lingering on. That’s a $700-800 used card doing over a hundred tokens per second with a model that scores 69.2 on SWE-bench Verified. For context, Qwen3-32B (the previous generation dense model) only hit around 30-35 tok/s on the same hardware because all 32B parameters activated every token.

MoE is real.

For Mac users, there’s a dedicated MLX vs Ollama comparison with speed benchmarks across every Apple Silicon tier.


Qwen 3.5-27B: the smartest model you can fit on one card

The 27B is the only dense model in the Qwen 3.5 family. All 27 billion parameters fire on every forward pass. No expert routing, no MoE overhead, no activation-ratio tricks. It is slower per token than the 35B-A3B, but the per-token quality is higher because every parameter contributes to every prediction.

How much higher? Artificial Analysis gave it an Intelligence Index score of 42 out of 51, ranking it #1 among all open-weight models in the 4B-40B class. That matches DeepSeek-V3.2 on reasoning.

How 27B compares to 35B-A3B

Benchmark27B dense35B-A3B MoEWinner
GPQA Diamond85.584.227B
MMLU-Pro86.185.327B
LiveCodeBench v680.774.627B
SWE-bench Verified72.469.227B
IFEval95.091.927B
CodeForces1,8992,02835B
BFCL-V4 (tool use)68.567.327B

The 27B wins on every reasoning and coding benchmark except competitive programming (CodeForces). The 35B-A3B wins on speed by 3-5x. Pick based on your bottleneck: if you’re waiting on the model to think harder, use the 27B. If you’re waiting on the model to type faster, use the 35B-A3B.

VRAM requirements: 27B dense

QuantizationFile sizeMemory neededFits on
IQ4_XS14.7 GB~16 GBRTX 4060 Ti 16GB (best fit)
Q4_K_S15.6 GB~17 GB16GB GPU with room for short context
Q4_K_M16.5 GB~18 GB16GB GPU (tight) or 24GB GPU
Q5_K_M19.4 GB~21 GBRTX 3090 24GB
Q6_K22.7 GB~24 GBRTX 3090 24GB (limited context)
Q8_028.6 GB~30 GBRTX 4090/5090 or Mac 32GB+
BF1653.8 GB~54 GBMulti-GPU or Mac 64GB+

16GB warning: Q4_K_M at 16.5 GB technically fits a 16GB card, but there’s almost no room for KV cache. Practical context will be limited to 4K-8K tokens before you OOM. For 16GB GPUs, use IQ4_XS (14.7 GB) or Q4_K_S (15.6 GB) and accept shorter context. On a 24GB card, Q4_K_M runs with full 131K context at around 24 GB total.

Speed benchmarks: 27B dense

HardwareQuantContextGeneration (tok/s)Prompt (tok/s)
RTX 3090 24GBQ4_K4K33.51,104
RTX 3090 24GBQ4_K86K27.5599

About 3x slower than the 35B-A3B on the same GPU. This is the MoE vs dense tradeoff in one table: the 35B-A3B hits 111 tok/s where the 27B hits 33.5 tok/s on identical hardware. Both produce good output. The 27B produces slightly better output on average, at one-third the speed.

For coding tasks where quality matters more than throughput – debugging a tricky function, writing a complex SQL query, reviewing a PR – the 27B is worth the wait. For chat, brainstorming, and tool-calling agents where you want snappy responses, the 35B-A3B wins.


Qwen 3.5-122B-A10B: multi-GPU territory

The 122B is the model for people who have more hardware than patience. 122 billion total parameters, 10 billion active per token (8 routed + 1 shared from 256 experts). Apache 2.0. It ties GPT-5 mini on SWE-bench (72.0) and beats it on tool use by 30% (BFCL-V4: 72.2 vs 55.5).

VRAM requirements: 122B MoE

QuantizationFile sizeMemory neededFits on
Q3_K_M~59 GB~62 GBMac 64GB tight, 3x 3090 with offload
Q4_K_M74.4 GB~76 GBMac 96GB+, 4x 3090, 2x 4090
Q5_K_M87.1 GB~90 GBMac 128GB, multi-GPU
Q6_K100.8 GB~106 GBMac 128GB+
Q8_0129.9 GB~132 GBMac 192GB or 2x A100 80GB
BF16244 GB~245 GBMulti-GPU cluster

Three RTX 3090s (72GB total) can run the Q3-level quant with full GPU offload, or Q4_K_M with partial system RAM spillover. A 64GB Mac runs the model at Q3 via MLX. A 96GB+ Mac runs Q4_K_M comfortably.

Benchmarks: 122B vs the competition

Benchmark122B-A10BGPT-5 miniClaude Sonnet 4.5
SWE-bench Verified72.072.062.0
BFCL-V4 (tool use)72.255.5
Terminal Bench 249.431.9
GPQA Diamond86.685.783.4
MMLU-Pro86.7
CodeForces2,100

This is a model that competes with commercial APIs while running on hardware you own. The tool use gap (72.2 vs 55.5) is why agent builders are paying attention.

For local inference speed, the DGX Spark (128GB unified, NVFP4) benchmarked at 8-15 tok/s depending on mode. Consumer multi-GPU numbers are still sparse since the model is four days old.


Which model should you run?

Your hardwareModelQuantWhat you get
RTX 3060 12GB35B-A3BIQ3_XSFits, decent quality, ~25-30 tok/s
RTX 4060 Ti 16GB35B-A3BQ4_K_MSweet spot. 44 tok/s at 100K context
RTX 4060 Ti 16GB27B denseIQ4_XSSmarter, slower, short context
RTX 3090 24GB35B-A3BQ4_K_M111 tok/s. Full 262K context.
RTX 3090 24GB27B denseQ5_K_MBest reasoning quality on one card
RTX 4090/509035B-A3BQ6_K-Q8Overkill speed, high quality
2-3x 3090/4090122B-A10BQ3-Q4API-competitive output, your hardware
Mac 16GB35B-A3BQ4 (MLX)Tight but works. MLX route preferred.
Mac 32-48GB27B or 35BQ4-Q8Both comfortable. Pick by use case.
Mac 64GB+122B-A10BQ4 (MLX)Runs Q4 comfortably
CPU only (DDR5)35B-A3BQ4~5-6 tok/s. Usable for async tasks.

If you own one GPU and want one model, get the 35B-A3B at Q4_K_M. It fits everywhere from 16GB to 32GB, runs fast, and the quality gap vs the 27B is small outside of hard reasoning tasks.


The overthinking problem

Qwen 3.5 defaults to thinking mode. Every response starts with a <think>...</think> block where the model reasons through the problem before answering. For complex math or multi-step coding, this helps. For “what time zone is Tokyo in?”, the model burns hundreds of tokens second-guessing itself before stating the obvious.

Unlike Qwen 3, Qwen 3.5 does not support /think and /nothink commands in prompts. The model cards say this explicitly. You cannot flip thinking on and off per-message the way you could with Qwen 3.

How to disable thinking

llama.cpp:

llama-server -m qwen3.5-35b-a3b-q4_k_m.gguf \
  --chat-template-kwargs '{"enable_thinking": false}' \
  -c 131072 -ngl 99

Ollama: Thinking is handled at the template level. As of v0.17.4, there’s no clean per-request toggle. Workaround: create a custom Modelfile with a system prompt that says “Do not use chain-of-thought reasoning. Answer directly.” This reduces (but doesn’t eliminate) thinking tokens.

vLLM / OpenAI-compatible API:

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-35B-A3B",
    messages=[{"role": "user", "content": "What time zone is Tokyo in?"}],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

Token budget approach: If you want thinking but capped, the thinking_budget API parameter limits how many reasoning tokens the model generates before it’s forced to answer. Set it to 200-500 tokens for most tasks. NVIDIA NIM implements this as nvext.max_thinking_tokens.


The quant situation

Q4_K_M is the sweet spot

For most setups, Q4_K_M is the right quant. It balances file size, quality, and speed. The quality gap between Q4_K_M and Q8_0 exists but is small for general use.

Unsloth GGUF bug (now fixed)

Unsloth’s initial “Dynamic 2.0” GGUF quants (UD-Q2_K_XL, UD-Q3_K_XL, UD-Q4_K_XL) had a bug: they applied MXFP4 precision to attention tensors where it caused quality degradation. Symptoms included garbled output and repetition loops, especially on the 122B model.

Fixed on February 27, 2026. Unsloth retired MXFP4 from all XL quant recipes. If you downloaded Qwen 3.5 GGUFs before that date, re-download them. The standard Q4_K_M quants from bartowski or lmstudio-community were never affected.

KV cache at Q8: free VRAM

Quantizing the KV cache from FP16 to Q8_0 halves the VRAM used for context with essentially no quality loss. Perplexity increase is 0.0043. That’s nothing.

llama.cpp:

llama-server -m model.gguf -ctk q8_0 -ctv q8_0 -c 131072 -ngl 99

Ollama:

OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_FLASH_ATTENTION=1 ollama serve

Flash Attention must be enabled for KV cache quantization. On the 35B-A3B, this saves roughly 2-3 GB at long context – enough to squeeze in an extra 50K tokens of conversation history on a 24GB card.


llama.cpp –fit: stop guessing layer counts

llama.cpp added automatic VRAM fitting in PR #16653 (merged December 2025). The --fit flag is on by default in recent builds. It does virtual test allocations and iteratively adjusts layer offloading until the model maximizes your GPU utilization.

For MoE models specifically, --fit prioritizes keeping dense layers (attention, embedding) in VRAM and spills sparse expert weights to system RAM first. This is the right tradeoff: expert routing means only ~3% of expert weights activate per token, so the RAM latency penalty is small for those tensors.

If you’re manually setting --n-gpu-layers, you’re disabling --fit. Unless you have a specific reason, let the automation handle it.


Ollama tool calling: fixed, mostly

v0.17.3 (February 27, 2026): Fixed parsing of tool calls emitted during thinking mode. Before this, if Qwen 3.5 decided to call a tool while still inside a <think> block, the tool call was silently dropped.

v0.17.4 (February 27, 2026): Added stable indices for parallel tool calls. Also added official Qwen 3.5 model tags to the Ollama library.

Still broken: The renderer side of multi-turn tool calling has issues – prompts sent back to the model can contain unclosed <think> tags, corrupting subsequent turns. Penalty sampling (repeat_penalty, presence_penalty) is also silently ignored by the Go runner, which matters because Qwen 3.5’s official sampling parameters include presence_penalty=1.5 to prevent repetition loops.

If you’re building agents with Qwen 3.5 tool calling, test with llama.cpp or vLLM first. Ollama works for basic tool use but gets unreliable in multi-turn agentic loops.


How to run it

Ollama (simplest)

# 35B-A3B (recommended for most users)
ollama pull qwen3.5:35b-a3b
ollama run qwen3.5:35b-a3b

# 27B dense
ollama pull qwen3.5:27b
ollama run qwen3.5:27b

# 122B MoE (needs ~76GB)
ollama pull qwen3.5:122b
ollama run qwen3.5:122b

llama.cpp (more control)

# Download from HuggingFace
huggingface-cli download bartowski/Qwen_Qwen3.5-35B-A3B-GGUF \
  --include "*Q4_K_M*" \
  --local-dir ./qwen35

# Run with KV cache quantization + auto-fit
llama-server \
  -m ./qwen35/Qwen3.5-35B-A3B-Q4_K_M.gguf \
  -c 131072 \
  -ctk q8_0 -ctv q8_0 \
  -ngl 99

Mac (MLX for speed, Ollama for ecosystem)

MLX runs the 35B-A3B roughly 2x faster than Ollama on Apple Silicon. See the MLX vs Ollama benchmark guide for setup instructions and speed numbers across every chip tier.

# MLX route (fastest on Mac)
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3.5-35B-A3B-4bit \
  --prompt "Explain MoE architecture" --max-tokens 500

Known issues (February 2026)

Repetition in thinking mode: Without presence_penalty=1.5, the model tends to loop during extended reasoning. Set this in your inference config. Ollama silently ignores penalty parameters as of v0.17.4 – use llama.cpp or vLLM if you need reliable penalty sampling.

Vision model assertion in llama.cpp: The multimodal model assertion prevents KV cache reuse, forcing full prompt reprocessing every turn. If you’re text-only, don’t pass --mmproj. A fix PR exists but hasn’t merged.

27B CUDA eval bug: Issue #19860 reports CUDA errors during evaluation. Make sure you’re on the latest llama.cpp build.

Qwen 3.5 is NOT Qwen 3. The model architecture is different (Gated DeltaNet vs standard transformer). Chat templates differ. The /think//nothink toggle from Qwen 3 does not work. Don’t assume your Qwen 3 configs transfer cleanly.


Bottom line

The 35B-A3B is the model that matters for most local AI builders. It runs on a $450 GPU at speeds that feel like a cloud API, scores within 5% of the 27B dense on most benchmarks, and fits more context than you’ll probably use. The 27B is the choice when quality per token matters more than throughput. The 122B is for multi-GPU setups that want output matching commercial APIs without the monthly bill.

Q4_K_M, KV cache at Q8, and let --fit handle the layer allocation. That’s the setup. The rest is just choosing your model.