More on this topic: Gemma Models Guide | Qwen 3.5 Local Guide | VRAM Requirements | GPU Buying Guide | TurboQuant KV Cache Compression

Google just shipped Gemma 4, and two things matter more than the benchmarks: it’s Apache 2.0, and it does vision, video, and audio in a single model that fits on consumer hardware.

Gemma 3 had a restrictive license that scared off anyone building commercial products. Qwen and Llama ate its lunch. Gemma 4 fixes that with a clean Apache 2.0 license – no custom clauses, no “Harmful Use” carve-outs, no legal overhead. That alone makes this worth paying attention to.

The model quality is strong too. The 31B dense variant ranks #3 among open models on LMArena (ELO 1452), and the 26B-A4B MoE punches at nearly the same level (#6, ELO 1441) while running at 150 tok/s on an RTX 4090. Here’s what you need to know to run it locally.


The lineup

Gemma 4 ships in four sizes, all multimodal:

ModelTotal ParamsActive ParamsContextModalitiesOllama Size
E2B5.1B2.3B128KText + Image + Video + Audio7.2 GB
E4B8B4.5B128KText + Image + Video + Audio9.6 GB
26B-A4B26B3.8B (MoE)256KText + Image + Video18 GB
31B31B31B (Dense)256KText + Image + Video20 GB

The naming is a bit confusing. E2B and E4B are “edge” models – the numbers reflect effective compute, not raw parameters. The 26B-A4B is a Mixture-of-Experts model with 128 experts where 8 are active per token, giving you big-model quality at small-model inference cost.

All four models support image and video input out of the box. The E2B and E4B edge models also handle audio input (speech recognition and audio scene understanding), with audio for the larger models coming later.

Architecture highlights: alternating sliding-window and full-context attention layers, Per-Layer Embeddings (PLE) for better representation, and shared KV cache where later layers reuse K/V from earlier ones to save VRAM.


Benchmarks: the actual numbers

Reasoning and knowledge

BenchmarkE2BE4B26B-A4B31BGemma 3 27B
MMLU Pro60.0%69.4%82.6%85.2%67.6%
AIME 202637.5%42.5%88.3%89.2%20.8%
GPQA Diamond43.4%58.6%82.3%84.3%42.4%
MMMLU67.4%76.6%86.3%88.4%70.7%

The jump from Gemma 3 to Gemma 4 is massive. AIME goes from 20.8% to 89.2% on the 31B. That’s not an incremental improvement – it’s a generational leap.

Coding

BenchmarkE4B26B-A4B31BGemma 3 27B
LiveCodeBench v652.0%77.1%80.0%29.1%
Codeforces ELO94017182150110

The 31B’s Codeforces ELO of 2150 puts it in competitive programmer territory. The 26B-A4B at 1718 is strong too, especially considering it only uses 3.8B active params per token.

Vision

BenchmarkE4B26B-A4B31BGemma 3 27B
MMMU Pro52.6%73.8%76.9%49.7%
MATH-Vision59.5%82.4%85.6%46.0%

The official vision benchmarks look strong, but early community testing has been mixed. Some users report OCR failures and infinite loops on basic image reading tasks – even with the non-quantized 31B. The model supports configurable vision token budgets (70 to 1120 tokens per image), and higher budgets (560-1120) are required for OCR and small text. If vision output seems poor, increase the token budget before giving up. That said, for structured OCR and document parsing, Qwen 3.5 remains more reliable in community testing.

LMArena rankings

  • 31B: ELO 1452, #3 open model globally
  • 26B-A4B: ELO 1441, #6 open model globally

For context, the Chinese models (Qwen 3.5, GLM-5, Kimi K2.5) still hold the top spots, but not by much.


VRAM requirements

ModelQ4 VRAMQ8 VRAMBF16 VRAMBest GPU Fit
E2B~4 GB5-8 GB10 GBAny 8GB GPU, Apple Silicon 8GB
E4B5-6 GB9-12 GB16 GBRTX 3060 12GB, M-series 8-16GB
26B-A4B16-18 GB28-30 GB52 GBRTX 4090, RTX 5060 Ti 16GB (tight), M-series 32GB+
31B17-20 GB34-38 GB62 GBRTX 3090/4090, M-series 32GB+

KV cache warning: Gemma 4’s hidden VRAM cost

The weights fit. The KV cache might not. Gemma 4’s architecture (60 layers, alternating sliding-window and full attention) produces a KV cache that’s 2-3x larger than comparable models at the same context length. The 31B uses ~0.85 MB per context token without KV quantization. In practice:

Context Length31B KV Cache (FP16)Total VRAM (Q4 weights + KV)
4K~3.4 GB + 3.6 GB SWA~20 GB
8K~6.8 GB + 3.6 GB SWA~21 GB
64K~54 GB~25 GB (with Q4 KV)
128K~109 GB~30 GB (with Q4 KV)
256K~218 GB~40 GB (with Q4 KV)

That 3.6 GB SWA (Sliding Window Attention) cache is a fixed cost regardless of context length. And by default, llama.cpp allocates 4 parallel SWA cache slots (14.4 GB total). If you’re a single user, add -np 1 to your launch command to drop this to a single 3.6 GB slot – an instant ~11 GB savings.

The good news: Q4_0 KV cache quantization drops the per-token cost from 0.85 MB to ~0.038 MB (a 22x reduction), and early testing confirms it’s nearly lossless for Gemma 4 because only 10 of 60 layers use full attention. The llama.cpp team has patched the worst of the bloat – pull the latest llama.cpp build or wait for the next Ollama update.

With TurboQuant KV cache compression, the full 256K context fits on a single RTX 5090 at Q4 weights. On 3x RTX 3090s, TurboQuant enables the full 262K native context window entirely in VRAM.

Bottom line for the 31B: A 24GB GPU (RTX 3090/4090) fits the 31B at Q4 up to ~64K context. For 256K, you need 40+ GB or TurboQuant. The 26B-A4B MoE is far friendlier – only ~23 GB at full 256K with Q4.

The E4B is the 8GB GPU play. At Q4, it needs about 6 GB – leaving room for context on a 12GB RTX 3060 or an 8GB M-series Mac.

GGUF quant sizes (26B-A4B, Unsloth Dynamic)

QuantFile Size
UD-Q2_K_XL10.5 GB
UD-Q3_K_M12.5 GB
UD-Q4_K_M16.9 GB
UD-Q5_K_M21.2 GB
Q8_026.9 GB
BF1650.5 GB

How to run it

Ollama (easiest)

Day-0 support. Pull and run:

ollama run gemma4          # Default: E4B
ollama run gemma4:e2b      # Edge 2B
ollama run gemma4:26b      # MoE 26B-A4B
ollama run gemma4:31b      # Dense 31B

LM Studio

Available at launch. All four variants in GGUF format through the model browser. Search “gemma-4” and pick your quant level.

llama.cpp

Day-0 support with multimodal. Download the GGUF and the vision projector file:

./llama-cli -m gemma-4-26b-a4b-it-Q4_K_M.gguf \
  --mmproj gemma-4-26b-a4b-it-mmproj.gguf \
  -p "Describe this image" --image photo.jpg

Critical tip for VRAM: If you’re running single-user (not serving multiple clients), add -np 1 to your command. This cuts the SWA cache allocation from 4 slots to 1, saving ~11 GB of VRAM on the 31B. For the server, use --cache-type-k q4_0 --cache-type-v q4_0 for KV cache quantization – nearly lossless on Gemma 4 due to its hybrid attention architecture.

# Single-user optimized (saves ~11 GB on 31B)
./llama-cli -m gemma-4-31b-it-Q4_K_M.gguf -np 1 \
  --cache-type-k q4_0 --cache-type-v q4_0

Update your build: The initial release had KV cache bloat and chat template bugs (PRs #21326, #21343). These are fixed in recent builds. Pull the latest llama.cpp or wait for the next Ollama release.

Unsloth GGUFs

Already on HuggingFace with Dynamic quantization (UD-) for all variants. These use per-layer variable precision for better quality at the same file size. Look for unsloth/gemma-4-26B-A4B-it-GGUF and similar.


Gemma 4 vs Qwen 3.5: should you switch?

This is the question everyone’s asking. Qwen 3.5 has been the local AI default for weeks. Does Gemma 4 change that?

26B-A4B MoE vs Qwen 3.5 35B-A3B MoE

Gemma 4 26B-A4BQwen 3.5 35B-A3B
Active params3.8B3B
VRAM (Q4)~17 GB~17 GB
LMArena ELO1441Higher (ranked above)
Context256K131K
VisionYesYes
AudioNo (coming)No
LicenseApache 2.0Apache 2.0

Similar VRAM, similar speed. After a week of community testing, the verdict is forming: Qwen 3.5 still leads on coding (Codeforces ELO 2028 vs 1718 for the MoE) and multilingual tasks (201 languages, 250K token vocabulary). Gemma 4 wins on context length (256K vs 131K) and Apache 2.0 licensing. Vision is a wash – Gemma 4 benchmarks higher but Qwen 3.5 is more reliable for structured OCR in practice.

Speed note: on some hardware (notably RTX 5060 Ti 16GB), Gemma 4 26B-A4B runs at only ~11 tok/s vs 60+ tok/s for Qwen 3.5 35B-A3B. This appears to be caused by Gemma 4’s heterogeneous attention head dimensions forcing slower kernel fallbacks. On RTX 4090/5090, the gap is much smaller.

31B dense vs Qwen 3.5 27B dense

Both fit on a 24GB card at Q4 – but Gemma 4’s KV cache bloat limits usable context to ~64K on 24GB, while Qwen 3.5 27B fits ~190K context on the same RTX 5090. Gemma 4 31B scores higher on AIME (89.2%) and Codeforces ELO (2150 vs 1899). Qwen 3.5 27B wins on SWE-bench coding (72.4%), HLE with tools (48.5% vs 26.5%), and multilingual. Extended testing shows Qwen producing “more architecturally sound solutions” in complex coding scenarios.

The context window gap is the practical differentiator. If you need long context on limited VRAM, Qwen 3.5 is dramatically more efficient.

E4B vs Qwen 3.5 9B at 8GB

The E4B has fewer effective parameters (4.5B vs 9B) but adds audio and video input that Qwen doesn’t have. For pure text quality at the 8GB tier, Qwen 3.5 9B wins. For multimodal use cases on constrained hardware, Gemma 4 E4B is more capable.

Community speed benchmarks

HardwareModelGeneration (tok/s)
RTX 509026B-A4B Q4, 4K ctx180
RTX 409026B-A4B Q4_K_XL~150
RTX 309026B-A4B Q4_K_M, 4K ctx119
RTX 309031B Q4, 4K ctx34
RTX 5060 Ti26B-A4B~11 (kernel fallback issue)
M4 Pro 48GBE4B54
Raspberry Pi 5E4B~2.9
DGX Spark26B MoE23.7

The Apache 2.0 switch

This might matter more than the benchmarks.

Gemma 1 through 3 used Google’s custom “Gemma Terms of Use” license. It had vague “Harmful Use” restrictions, limits on redistribution, and Google could update the terms whenever they wanted. Legal teams at companies considering Gemma for products would look at that license, look at Qwen’s Apache 2.0, and pick Qwen.

Gemma 4 under Apache 2.0 removes all of that. No custom clauses. Full commercial use. Modify, redistribute, deploy however you want. VentureBeat called the license change “the most consequential commercial signal in the launch.”

For hobbyists, this doesn’t change much – you were running Gemma anyway. For anyone building a product, this puts Gemma 4 back on the table.


Who should care

If you’re happy with Qwen 3.5: Stay. After a week of community testing, Qwen 3.5 still leads on coding, multilingual, and VRAM efficiency for long context. Gemma 4 is competitive but not a clear upgrade for most text-only workflows.

If you need multimodal on consumer hardware: Gemma 4 E4B with vision and audio on 8GB VRAM is genuinely new. No other open model at this size does text + image + video + audio. But set realistic expectations on vision quality – increase the token budget to 560+ for OCR tasks.

If you avoided Gemma because of the license: Apache 2.0 changes everything. Gemma 4 is now a first-class option for commercial deployment.

If you have 16-24GB VRAM: The 26B-A4B MoE at 119-150 tok/s on RTX 3090/4090 is fast. At 17GB Q4, it fits on 24GB cards with room for context. Be aware of the KV cache bloat on the 31B dense – use -np 1 and KV cache quantization. Worth testing alongside Qwen 3.5 35B-A3B on your specific tasks.

If you’re on edge hardware: The E2B runs on a Raspberry Pi 5 at ~3 tok/s. The E4B hits 54 tok/s on an M4 Pro. These are real options for embedded and mobile AI.


Bottom line

Gemma 4 is Google finally getting serious about open models. The quality is competitive with Qwen 3.5, the multimodal story is broader (audio!), the license is fixed, and framework support was ready at launch.

After a week of community testing, the picture is clear: Qwen 3.5 wins on coding, multilingual, and VRAM efficiency for long context. Gemma 4 wins on context window size (256K), multimodal breadth, and Apache 2.0 licensing. The KV cache bloat on the 31B is a real limitation that narrows its usable context on consumer GPUs – but the 26B-A4B MoE avoids most of that pain.

The 26B-A4B MoE remains the model to try first. It fits on a 24GB card, runs fast (119 tok/s on RTX 3090), handles vision, and has 256K context. Use -np 1 and KV cache quantization for best results:

ollama run gemma4:26b