Gemma 4 Just Dropped: What Local AI Builders Need to Know
More on this topic: Gemma Models Guide | Qwen 3.5 Local Guide | VRAM Requirements | GPU Buying Guide | TurboQuant KV Cache Compression
Google just shipped Gemma 4, and two things matter more than the benchmarks: it’s Apache 2.0, and it does vision, video, and audio in a single model that fits on consumer hardware.
Gemma 3 had a restrictive license that scared off anyone building commercial products. Qwen and Llama ate its lunch. Gemma 4 fixes that with a clean Apache 2.0 license – no custom clauses, no “Harmful Use” carve-outs, no legal overhead. That alone makes this worth paying attention to.
The model quality is strong too. The 31B dense variant ranks #3 among open models on LMArena (ELO 1452), and the 26B-A4B MoE punches at nearly the same level (#6, ELO 1441) while running at 150 tok/s on an RTX 4090. Here’s what you need to know to run it locally.
The lineup
Gemma 4 ships in four sizes, all multimodal:
| Model | Total Params | Active Params | Context | Modalities | Ollama Size |
|---|---|---|---|---|---|
| E2B | 5.1B | 2.3B | 128K | Text + Image + Video + Audio | 7.2 GB |
| E4B | 8B | 4.5B | 128K | Text + Image + Video + Audio | 9.6 GB |
| 26B-A4B | 26B | 3.8B (MoE) | 256K | Text + Image + Video | 18 GB |
| 31B | 31B | 31B (Dense) | 256K | Text + Image + Video | 20 GB |
The naming is a bit confusing. E2B and E4B are “edge” models – the numbers reflect effective compute, not raw parameters. The 26B-A4B is a Mixture-of-Experts model with 128 experts where 8 are active per token, giving you big-model quality at small-model inference cost.
All four models support image and video input out of the box. The E2B and E4B edge models also handle audio input (speech recognition and audio scene understanding), with audio for the larger models coming later.
Architecture highlights: alternating sliding-window and full-context attention layers, Per-Layer Embeddings (PLE) for better representation, and shared KV cache where later layers reuse K/V from earlier ones to save VRAM.
Benchmarks: the actual numbers
Reasoning and knowledge
| Benchmark | E2B | E4B | 26B-A4B | 31B | Gemma 3 27B |
|---|---|---|---|---|---|
| MMLU Pro | 60.0% | 69.4% | 82.6% | 85.2% | 67.6% |
| AIME 2026 | 37.5% | 42.5% | 88.3% | 89.2% | 20.8% |
| GPQA Diamond | 43.4% | 58.6% | 82.3% | 84.3% | 42.4% |
| MMMLU | 67.4% | 76.6% | 86.3% | 88.4% | 70.7% |
The jump from Gemma 3 to Gemma 4 is massive. AIME goes from 20.8% to 89.2% on the 31B. That’s not an incremental improvement – it’s a generational leap.
Coding
| Benchmark | E4B | 26B-A4B | 31B | Gemma 3 27B |
|---|---|---|---|---|
| LiveCodeBench v6 | 52.0% | 77.1% | 80.0% | 29.1% |
| Codeforces ELO | 940 | 1718 | 2150 | 110 |
The 31B’s Codeforces ELO of 2150 puts it in competitive programmer territory. The 26B-A4B at 1718 is strong too, especially considering it only uses 3.8B active params per token.
Vision
| Benchmark | E4B | 26B-A4B | 31B | Gemma 3 27B |
|---|---|---|---|---|
| MMMU Pro | 52.6% | 73.8% | 76.9% | 49.7% |
| MATH-Vision | 59.5% | 82.4% | 85.6% | 46.0% |
The official vision benchmarks look strong, but early community testing has been mixed. Some users report OCR failures and infinite loops on basic image reading tasks – even with the non-quantized 31B. The model supports configurable vision token budgets (70 to 1120 tokens per image), and higher budgets (560-1120) are required for OCR and small text. If vision output seems poor, increase the token budget before giving up. That said, for structured OCR and document parsing, Qwen 3.5 remains more reliable in community testing.
LMArena rankings
- 31B: ELO 1452, #3 open model globally
- 26B-A4B: ELO 1441, #6 open model globally
For context, the Chinese models (Qwen 3.5, GLM-5, Kimi K2.5) still hold the top spots, but not by much.
VRAM requirements
| Model | Q4 VRAM | Q8 VRAM | BF16 VRAM | Best GPU Fit |
|---|---|---|---|---|
| E2B | ~4 GB | 5-8 GB | 10 GB | Any 8GB GPU, Apple Silicon 8GB |
| E4B | 5-6 GB | 9-12 GB | 16 GB | RTX 3060 12GB, M-series 8-16GB |
| 26B-A4B | 16-18 GB | 28-30 GB | 52 GB | RTX 4090, RTX 5060 Ti 16GB (tight), M-series 32GB+ |
| 31B | 17-20 GB | 34-38 GB | 62 GB | RTX 3090/4090, M-series 32GB+ |
KV cache warning: Gemma 4’s hidden VRAM cost
The weights fit. The KV cache might not. Gemma 4’s architecture (60 layers, alternating sliding-window and full attention) produces a KV cache that’s 2-3x larger than comparable models at the same context length. The 31B uses ~0.85 MB per context token without KV quantization. In practice:
| Context Length | 31B KV Cache (FP16) | Total VRAM (Q4 weights + KV) |
|---|---|---|
| 4K | ~3.4 GB + 3.6 GB SWA | ~20 GB |
| 8K | ~6.8 GB + 3.6 GB SWA | ~21 GB |
| 64K | ~54 GB | ~25 GB (with Q4 KV) |
| 128K | ~109 GB | ~30 GB (with Q4 KV) |
| 256K | ~218 GB | ~40 GB (with Q4 KV) |
That 3.6 GB SWA (Sliding Window Attention) cache is a fixed cost regardless of context length. And by default, llama.cpp allocates 4 parallel SWA cache slots (14.4 GB total). If you’re a single user, add -np 1 to your launch command to drop this to a single 3.6 GB slot – an instant ~11 GB savings.
The good news: Q4_0 KV cache quantization drops the per-token cost from 0.85 MB to ~0.038 MB (a 22x reduction), and early testing confirms it’s nearly lossless for Gemma 4 because only 10 of 60 layers use full attention. The llama.cpp team has patched the worst of the bloat – pull the latest llama.cpp build or wait for the next Ollama update.
With TurboQuant KV cache compression, the full 256K context fits on a single RTX 5090 at Q4 weights. On 3x RTX 3090s, TurboQuant enables the full 262K native context window entirely in VRAM.
Bottom line for the 31B: A 24GB GPU (RTX 3090/4090) fits the 31B at Q4 up to ~64K context. For 256K, you need 40+ GB or TurboQuant. The 26B-A4B MoE is far friendlier – only ~23 GB at full 256K with Q4.
The E4B is the 8GB GPU play. At Q4, it needs about 6 GB – leaving room for context on a 12GB RTX 3060 or an 8GB M-series Mac.
GGUF quant sizes (26B-A4B, Unsloth Dynamic)
| Quant | File Size |
|---|---|
| UD-Q2_K_XL | 10.5 GB |
| UD-Q3_K_M | 12.5 GB |
| UD-Q4_K_M | 16.9 GB |
| UD-Q5_K_M | 21.2 GB |
| Q8_0 | 26.9 GB |
| BF16 | 50.5 GB |
How to run it
Ollama (easiest)
Day-0 support. Pull and run:
ollama run gemma4 # Default: E4B
ollama run gemma4:e2b # Edge 2B
ollama run gemma4:26b # MoE 26B-A4B
ollama run gemma4:31b # Dense 31B
LM Studio
Available at launch. All four variants in GGUF format through the model browser. Search “gemma-4” and pick your quant level.
llama.cpp
Day-0 support with multimodal. Download the GGUF and the vision projector file:
./llama-cli -m gemma-4-26b-a4b-it-Q4_K_M.gguf \
--mmproj gemma-4-26b-a4b-it-mmproj.gguf \
-p "Describe this image" --image photo.jpg
Critical tip for VRAM: If you’re running single-user (not serving multiple clients), add -np 1 to your command. This cuts the SWA cache allocation from 4 slots to 1, saving ~11 GB of VRAM on the 31B. For the server, use --cache-type-k q4_0 --cache-type-v q4_0 for KV cache quantization – nearly lossless on Gemma 4 due to its hybrid attention architecture.
# Single-user optimized (saves ~11 GB on 31B)
./llama-cli -m gemma-4-31b-it-Q4_K_M.gguf -np 1 \
--cache-type-k q4_0 --cache-type-v q4_0
Update your build: The initial release had KV cache bloat and chat template bugs (PRs #21326, #21343). These are fixed in recent builds. Pull the latest llama.cpp or wait for the next Ollama release.
Unsloth GGUFs
Already on HuggingFace with Dynamic quantization (UD-) for all variants. These use per-layer variable precision for better quality at the same file size. Look for unsloth/gemma-4-26B-A4B-it-GGUF and similar.
Gemma 4 vs Qwen 3.5: should you switch?
This is the question everyone’s asking. Qwen 3.5 has been the local AI default for weeks. Does Gemma 4 change that?
26B-A4B MoE vs Qwen 3.5 35B-A3B MoE
| Gemma 4 26B-A4B | Qwen 3.5 35B-A3B | |
|---|---|---|
| Active params | 3.8B | 3B |
| VRAM (Q4) | ~17 GB | ~17 GB |
| LMArena ELO | 1441 | Higher (ranked above) |
| Context | 256K | 131K |
| Vision | Yes | Yes |
| Audio | No (coming) | No |
| License | Apache 2.0 | Apache 2.0 |
Similar VRAM, similar speed. After a week of community testing, the verdict is forming: Qwen 3.5 still leads on coding (Codeforces ELO 2028 vs 1718 for the MoE) and multilingual tasks (201 languages, 250K token vocabulary). Gemma 4 wins on context length (256K vs 131K) and Apache 2.0 licensing. Vision is a wash – Gemma 4 benchmarks higher but Qwen 3.5 is more reliable for structured OCR in practice.
Speed note: on some hardware (notably RTX 5060 Ti 16GB), Gemma 4 26B-A4B runs at only ~11 tok/s vs 60+ tok/s for Qwen 3.5 35B-A3B. This appears to be caused by Gemma 4’s heterogeneous attention head dimensions forcing slower kernel fallbacks. On RTX 4090/5090, the gap is much smaller.
31B dense vs Qwen 3.5 27B dense
Both fit on a 24GB card at Q4 – but Gemma 4’s KV cache bloat limits usable context to ~64K on 24GB, while Qwen 3.5 27B fits ~190K context on the same RTX 5090. Gemma 4 31B scores higher on AIME (89.2%) and Codeforces ELO (2150 vs 1899). Qwen 3.5 27B wins on SWE-bench coding (72.4%), HLE with tools (48.5% vs 26.5%), and multilingual. Extended testing shows Qwen producing “more architecturally sound solutions” in complex coding scenarios.
The context window gap is the practical differentiator. If you need long context on limited VRAM, Qwen 3.5 is dramatically more efficient.
E4B vs Qwen 3.5 9B at 8GB
The E4B has fewer effective parameters (4.5B vs 9B) but adds audio and video input that Qwen doesn’t have. For pure text quality at the 8GB tier, Qwen 3.5 9B wins. For multimodal use cases on constrained hardware, Gemma 4 E4B is more capable.
Community speed benchmarks
| Hardware | Model | Generation (tok/s) |
|---|---|---|
| RTX 5090 | 26B-A4B Q4, 4K ctx | 180 |
| RTX 4090 | 26B-A4B Q4_K_XL | ~150 |
| RTX 3090 | 26B-A4B Q4_K_M, 4K ctx | 119 |
| RTX 3090 | 31B Q4, 4K ctx | 34 |
| RTX 5060 Ti | 26B-A4B | ~11 (kernel fallback issue) |
| M4 Pro 48GB | E4B | 54 |
| Raspberry Pi 5 | E4B | ~2.9 |
| DGX Spark | 26B MoE | 23.7 |
The Apache 2.0 switch
This might matter more than the benchmarks.
Gemma 1 through 3 used Google’s custom “Gemma Terms of Use” license. It had vague “Harmful Use” restrictions, limits on redistribution, and Google could update the terms whenever they wanted. Legal teams at companies considering Gemma for products would look at that license, look at Qwen’s Apache 2.0, and pick Qwen.
Gemma 4 under Apache 2.0 removes all of that. No custom clauses. Full commercial use. Modify, redistribute, deploy however you want. VentureBeat called the license change “the most consequential commercial signal in the launch.”
For hobbyists, this doesn’t change much – you were running Gemma anyway. For anyone building a product, this puts Gemma 4 back on the table.
Who should care
If you’re happy with Qwen 3.5: Stay. After a week of community testing, Qwen 3.5 still leads on coding, multilingual, and VRAM efficiency for long context. Gemma 4 is competitive but not a clear upgrade for most text-only workflows.
If you need multimodal on consumer hardware: Gemma 4 E4B with vision and audio on 8GB VRAM is genuinely new. No other open model at this size does text + image + video + audio. But set realistic expectations on vision quality – increase the token budget to 560+ for OCR tasks.
If you avoided Gemma because of the license: Apache 2.0 changes everything. Gemma 4 is now a first-class option for commercial deployment.
If you have 16-24GB VRAM: The 26B-A4B MoE at 119-150 tok/s on RTX 3090/4090 is fast. At 17GB Q4, it fits on 24GB cards with room for context. Be aware of the KV cache bloat on the 31B dense – use -np 1 and KV cache quantization. Worth testing alongside Qwen 3.5 35B-A3B on your specific tasks.
If you’re on edge hardware: The E2B runs on a Raspberry Pi 5 at ~3 tok/s. The E4B hits 54 tok/s on an M4 Pro. These are real options for embedded and mobile AI.
Bottom line
Gemma 4 is Google finally getting serious about open models. The quality is competitive with Qwen 3.5, the multimodal story is broader (audio!), the license is fixed, and framework support was ready at launch.
After a week of community testing, the picture is clear: Qwen 3.5 wins on coding, multilingual, and VRAM efficiency for long context. Gemma 4 wins on context window size (256K), multimodal breadth, and Apache 2.0 licensing. The KV cache bloat on the 31B is a real limitation that narrows its usable context on consumer GPUs – but the 26B-A4B MoE avoids most of that pain.
The 26B-A4B MoE remains the model to try first. It fits on a 24GB card, runs fast (119 tok/s on RTX 3090), handles vision, and has 256K context. Use -np 1 and KV cache quantization for best results:
ollama run gemma4:26b
Related Guides
- TurboQuant: 6x KV Cache Compression – essential for Gemma 4 31B at long context
- Gemma Models Guide (Gemma 1-3)
- Qwen 3.5 Local AI Guide
- How Much VRAM Do You Need?
- GPU Buying Guide
- Quantization Explained
Get notified when we publish new guides.
Subscribe — free, no spam