Llama 4 Guide: Running Scout and Maverick Locally (2026)
๐ More on this topic: Llama 3 Guide ยท VRAM Requirements ยท GPU Buying Guide ยท What Can You Run on 24GB
What’s New (May 2026)
Three months after publish, three updates change what this article should tell you.
Llama 4 Scout has an AAI score: 14. Artificial Analysis classifies Scout as non-reasoning (class median: 13). Qwen 3.6 27B Reasoning sits at 46 in a different class (median: 15). The scores aren’t directly comparable โ Scout’s class doesn’t include chain-of-thought time, Qwen’s does โ but the gap is informative. If your workload is reasoning-heavy (code, math, multi-step analysis), Qwen 3.6 27B is the stronger pick at a quarter of the VRAM. Scout’s unique features remain native multimodal input and the 10M context window, which no Qwen 3.6 variant matches.
Behemoth is on pause. Meta paused the ~2T-parameter teacher model in May 2026 amid internal capability concerns. The model isn’t cancelled โ it’s still serving Scout and Maverick via codistillation โ but there’s no public release date. For local AI buyers, treat Behemoth as effectively unavailable.
The MoE peer landscape filled in. Gemma 4 26B-A4B shipped in early April: 3.8B active across 128 experts (vs Scout’s 17B across 16). AAI 31 in the reasoning class. DeepSeek V4 also arrived. The original article framed Scout against dense Llama 3 and Qwen 3; the more honest May 2026 comparison is Scout vs Gemma 4 26B-A4B vs Qwen 3.6 35B-A3B โ three different MoE strategies at three different VRAM tiers. The comparison table is in the MoE Tradeoff section below.
For the broader Qwen ecosystem context, see our Qwen 3.7 Preview coverage โ Qwen 3.7 Max scored 57 AAI in May, with open 27B/35B weights announced for release.
Llama 4 Scout has 109 billion parameters but only uses 17 billion per token. Sounds amazing โ until you check what it actually needs to run locally.
Meta’s first MoE (Mixture of Experts) models are a genuine architecture shift from the dense Llama 3 family. Scout and Maverick activate only a fraction of their total parameters on each forward pass, which means faster inference than you’d expect from a 109B or 400B model. But MoE has a catch: you still need to load all the weights into memory, even though most of them sit idle on any given token.
This guide covers what you can actually run, what hardware you need, and when you should stick with Llama 3 instead.
The Llama 4 Family
| Model | Total Params | Active Params | Experts | Context | Ollama |
|---|---|---|---|---|---|
| Scout | 109B | 17B | 16 | 10M tokens | ollama run llama4:scout |
| Maverick | 400B | 17B | 128 | 1M tokens | ollama run llama4:maverick |
| Behemoth | ~2T+ | TBD | TBD | TBD | Not released |
All three use Mixture of Experts. Only Scout and Maverick are available. Behemoth was paused in May 2026 amid Meta’s internal capability concerns. The model isn’t cancelled โ it’s still serving as teacher for Scout and Maverick distillation โ but there’s no public release date. For local AI buyers, treat Behemoth as effectively unavailable.
What Changed from Llama 3
| Llama 3 | Llama 4 | |
|---|---|---|
| Architecture | Dense (all params active) | MoE (17B active per token) |
| Vision | Text-only (needed LLaVA adapter) | Native multimodal โ early fusion |
| Max context | 128K tokens | 10M tokens (Scout) |
| Languages | 8 | 12 (added Arabic, Hindi, Thai, Vietnamese) |
| Training data | 15T tokens | 40T tokens (Scout) |
| Smallest useful | 8B | 109B total (17B active) |
The jump from dense to MoE is the headline. Llama 3.3 70B activates all 70B parameters on every token. Scout activates 17B out of 109B โ faster inference per token, but you still carry the full 109B in memory.
VRAM Requirements
Here’s the reality check. MoE models need VRAM proportional to total parameters, not active parameters.
Scout (109B total, 17B active)
| Precision | VRAM Needed | Hardware | Practical? |
|---|---|---|---|
| BF16 (full) | ~216GB | 4x H100 | Datacenter only |
| Q8 | ~109GB | 2x H100 or Mac Studio 192GB | Prosumer |
| Q4 | ~55GB | 2x RTX 3090, Mac M4 Max 128GB | Enthusiast |
| 1.78-bit (Unsloth) | ~24GB | Single RTX 3090/4090 | Tight but works |
Maverick (400B total, 17B active)
| Precision | VRAM Needed | Hardware | Practical? |
|---|---|---|---|
| BF16 (full) | ~800GB | 7x H200 | Datacenter |
| Q4 | ~200GB | 3x H100 or exotic multi-GPU | Not for hobbyists |
| 1.78-bit (Unsloth) | ~100GB | 2x 48GB GPUs | Extreme enthusiast |
The bottom line: Scout at aggressive quantization is the only Llama 4 model that fits on consumer hardware. Maverick is a datacenter model for local purposes.
Hardware Reality Check
This is where it gets honest. If you’re reading InsiderLLM, you probably have a single GPU, maybe two. Here’s what actually works:
| Your Hardware | Can You Run Scout? | Notes |
|---|---|---|
| RTX 3060 12GB | No | Not even close. Stick with 7-8B models |
| RTX 4060 Ti 16GB | No | Still too small. 14B-class is your sweet spot |
| RTX 5070 Ti 16GB | No | 16GB GDDR7 doesn’t change the math โ Scout still needs 1.78-bit on a 16GB card. Supply constrained in May 2026. |
| RTX 5080 16GB | No | Same 16GB constraint as 5070 Ti. RTX 5080 SUPER (24GB) rumored October 2026 โ wait if you can. |
| RTX 3090 24GB | 1.78-bit only | ~20 tok/s. Usable for chat, degraded quality |
| RTX 4090 24GB | 1.78-bit only | ~20-25 tok/s. Same constraint, slightly faster |
| RTX 5090 32GB | Q4 viable | 32GB GDDR7 at 1792 GB/s unlocks Scout Q4 with context headroom. No firsthand tok/s yet. Supply constrained in May 2026. |
| 2x RTX 3090 (48GB) | Q4 works | Good experience. ~15-20 tok/s with tensor parallel |
| Mac M4 Pro 48GB | 1.78-bit only | Similar to single 24GB GPU |
| Mac M4 Max 128GB | Q4 works great | Unified memory shines here at 546 GB/s. MLX support. |
| Mac M5 Pro 64GB | 1.78-bit or Q4 tight | March 2026 release. 307 GB/s โ bandwidth-limited vs Max for MoE inference. |
| Mac M5 Max 128GB | Q4 works great | March 2026 release. 614 GB/s. Best portable Scout setup. |
| CPU only / mini PC | No | These models need real hardware |
For most budget builders: Llama 3.3 70B or Qwen 3.6 27B still deliver better results per dollar of VRAM. Scout only makes sense if you have 24GB+ AND specifically want native multimodal or the massive context window.
โ Check what fits your hardware with our Planning Tool.
Running Scout (Ollama and vLLM)
If you have the hardware:
# Pull and run Scout (default quantization)
ollama run llama4:scout
# Pull a specific quantization
ollama pull llama4:scout:q4_K_M # ~55GB, needs 2+ GPUs
For the 1.78-bit quantization that fits on 24GB, check Unsloth’s Llama 4 page โ they provide GGUF files optimized for single-GPU inference.
Multimodal (Images)
Scout handles images natively โ no adapter needed:
# In Ollama chat, reference an image
ollama run llama4:scout
>>> Describe this image: /path/to/photo.jpg
This is a genuine upgrade over Llama 3.2 Vision 11B, which was a bolted-on vision adapter. Scout’s early fusion architecture means the model reasons about text and images together from the ground up.
Production Serving with vLLM
For production serving rather than local single-user use, vLLM has supported Llama 4 since v0.8.3 (Day-0 release in collaboration with Meta). On 8xH100, vLLM serves Scout at full 1M context and Maverick at ~430K. The --kv-cache-dtype fp8 flag roughly doubles usable context with negligible accuracy cost. vLLM is overkill for most local AI buyers โ if you’re running Scout on a single workstation, stick with Ollama. vLLM matters when you’re serving multiple concurrent requests off rented or owned datacenter hardware.
Benchmarks: Scout vs the Field
Scout’s active parameter count (17B) puts it in an interesting spot โ competing against much smaller models on efficiency while targeting larger models on quality.
Quality Benchmarks
| Model | Active Params | MMLU | MATH | IFEval | GPQA |
|---|---|---|---|---|---|
| Llama 4 Scout | 17B | 79.6 | 50.3 | โ | โ |
| Llama 3.3 70B | 70B | 79.3 | 41.6 | 92.1 | โ |
| Llama 3.1 8B | 8B | 69.4 | 30.6 | 80.4 | โ |
| Qwen 3.6 27B | 27B | โ | โ | โ | โ |
Scout matches Llama 3.3 70B on MMLU (79.6 vs 79.3) while activating only 17B parameters per token โ a 4x efficiency gain. It crushes Llama 3.3 on math (50.3 vs 41.6 MATH).
But Qwen 3 has overtaken both on raw benchmarks. If you’re choosing purely on text quality and don’t need vision, Qwen 3.6 27B fits in 24GB at Q4 and benchmarks higher.
AAI as of May 2026. The composite Artificial Analysis Intelligence Index tells a different story than MMLU snapshots. Llama 4 Scout: 14 (non-reasoning class, median: 13). Qwen 3.6 27B Reasoning: 46. Qwen 3.6 35B-A3B Reasoning: 43. Gemma 4 26B-A4B Reasoning: 31. Reasoning-class scores aren’t directly comparable to Scout’s non-reasoning score โ different methodology, different scoring windows โ but the spread shows the reasoning-class models perform consistently above Scout on the AAI composite. See the MoE comparison table below for the full breakdown.
Inference Speed (Estimated, Consumer Hardware)
| Setup | Model | Quant | Speed |
|---|---|---|---|
| RTX 3090 24GB | Scout | 1.78-bit | ~20 tok/s |
| RTX 4090 24GB | Scout | 1.78-bit | ~20-25 tok/s |
| 2x RTX 3090 48GB | Scout | Q4_K_M | ~15-20 tok/s |
| RTX 3090 24GB | Llama 3.1 8B | Q4_K_M | ~67 tok/s |
| RTX 3090 24GB | Llama 3.3 70B | Q4 (offload) | ~8 tok/s |
The MoE architecture helps here: despite being a 109B model, Scout generates tokens at speeds closer to a 17B dense model. You still pay the VRAM tax for all 109B weights, but inference is fast once they’re loaded.
When to Use Llama 4 vs Stay on Llama 3
Use Scout if:
- You need vision + text in one model (not a bolted-on adapter)
- You have 24GB+ VRAM and accept 1.78-bit quality, OR 48GB+ for Q4
- You want the 10M token context window for massive document analysis
- You’re on a Mac with 64GB+ unified memory
Stay on Llama 3.1 8B if:
- You have less than 16GB VRAM
- You don’t need vision
- You want the fastest possible tok/s on modest hardware
Stay on Llama 3.3 70B if:
- You need the best Llama text quality and have the VRAM
- You’re already running it and it works for your use case
- You need the massive fine-tune ecosystem (thousands of community models)
Consider Qwen 3.6 27B instead if:
- You want top-tier text quality on 24GB VRAM
- You don’t need multimodal
- You care more about reasoning-class AAI scores than the Llama ecosystem
- You want to pair with MTP speculative decoding for 60+ tok/s on RTX 3090
The MoE Tradeoff, Explained
Mixture of Experts is not free magic. Here’s the deal:
The win: Only 17B parameters activate per token, so inference (token generation) is as fast as a 17B dense model. You get 109B-level quality at 17B-level speed.
The cost: All 109B parameters must be loaded into VRAM. The model is huge on disk and in memory, even though most weights sit idle on any given token. This is why Scout needs ~55GB at Q4 despite “only” using 17B params.
The implication for local users: MoE models are bandwidth-bound, not compute-bound. The bottleneck is loading all those expert weights, not doing math on them. This means VRAM capacity matters more than GPU compute speed โ which is why a Mac with 128GB unified memory (lower bandwidth but massive capacity) can outperform a single RTX 4090 (fast but only 24GB) for these models.
How Scout’s MoE Compares to Other Local MoE Models
| Model | Total | Active | Experts | Context | AAI | Class |
|---|---|---|---|---|---|---|
| Llama 4 Scout | 109B | 17B | 16 | 10M | 14 | Non-reasoning |
| Gemma 4 26B-A4B | 26B | 3.8B | 128 | 256K | 31 | Reasoning |
| Qwen 3.6 35B-A3B | 36B | 3B | โ | 256K | 43 | Reasoning |
Three different MoE strategies at three VRAM tiers. Scout uses the fewest, largest experts (16 experts, 17B active per token) โ closer to “dense-shaped MoE.” Gemma 4 26B-A4B sits at the opposite extreme with 128 small experts and only 3.8B active. Qwen 3.6 35B-A3B routes 3B active across an unspecified expert count (Alibaba hasn’t published this detail), prioritizing reasoning-class quality at a small VRAM footprint.
Reasoning-class scores (Gemma 4, Qwen 3.6) aren’t directly comparable to Scout’s non-reasoning score, but the spread is informative: Gemma 4 26B-A4B and Qwen 3.6 35B-A3B both deliver substantially higher AAI in their class while requiring far less VRAM. Scout’s headline advantage stays the 10M context window and native multimodal โ features neither Gemma 4 nor Qwen 3.6 35B-A3B offers.
What About Maverick?
Maverick uses 128 experts instead of Scout’s 16, with the same 17B active parameters. At Q4, it needs ~200GB VRAM. At 1.78-bit, it’s ~100GB.
ollama run llama4:maverick # Only if you have serious hardware
For local users, Maverick is realistically out of reach unless you have a multi-GPU server or a maxed-out Mac Studio. If you’re in that territory, it exists and it’s good โ but that’s not most people reading this guide.
Getting Started
If you have a 24GB GPU and want to try Scout:
- Update Ollama:
ollama --version(needs a recent version for Llama 4 support) - Pull Scout:
ollama run llama4:scout - For 1.78-bit on 24GB: grab the Unsloth GGUF from HuggingFace and load manually
- Test with an image to see native multimodal in action
If you’re on a single GPU under 24GB, skip Llama 4 entirely. Llama 3.1 8B, Qwen 3.5 9B or 14B class, or Gemma 4 26B-A4B will give you a better experience on your hardware. Don’t chase parameter counts โ chase what actually fits.
Get notified when we publish new guides.
Subscribe โ free, no spam