Mixtral 8x7B & 8x22B VRAM Requirements
๐ More on this topic: VRAM Requirements Guide ยท Mistral & Mixtral Guide ยท Quantization Explained ยท What Can You Run on 24GB VRAM
Mixtral models have some of the most confusing VRAM requirements in local AI. “8x7B” sounds like it should need 7B worth of memory. It doesn’t. It needs closer to 47B worth. And “8x22B” isn’t a 22B model โ it’s 141B parameters that all need to live in VRAM simultaneously.
This guide gives you exact numbers at every quantization level so you can figure out whether your GPU can actually run these models, or whether you’re better off with a dense alternative.
Why MoE VRAM Is Confusing
How Mixture of Experts Works
Mixtral uses a Mixture of Experts (MoE) architecture. Instead of one monolithic feedforward network, the model has 8 separate “expert” networks. For each token, a router selects 2 experts to process it. The other 6 sit idle.
This means Mixtral 8x7B has 46.7 billion total parameters but only activates 12.9 billion per token. Inference speed scales with active parameters โ so it runs roughly as fast as a 13B dense model.
Here’s the catch that trips everyone up: all 46.7B parameters must be loaded into VRAM. The router can’t predict which experts it’ll need for the next token, so every expert stays resident in memory. You’re paying 47B worth of VRAM for 13B worth of compute.
The Numbers That Matter
| Mixtral 8x7B | Mixtral 8x22B | |
|---|---|---|
| Total parameters | 46.7B | 141B |
| Active per token | 12.9B (2 of 8 experts) | 39B (2 of 8 experts) |
| Context window | 32K | 64K |
| VRAM behavior | Loads like a ~47B dense model | Loads like a ~141B dense model |
| Speed behavior | Runs like a ~13B dense model | Runs like a ~39B dense model |
This is the fundamental tradeoff of MoE: you get the speed of a smaller model and the quality of a larger one, but the VRAM cost of the larger one.
Mixtral 8x7B VRAM Requirements
Every quantization level, with real file sizes from GGUF builds:
| Quantization | Bits per Weight | Model Size | VRAM Needed* | Quality Impact |
|---|---|---|---|---|
| FP16 | 16 | ~93 GB | ~95 GB | Baseline (reference) |
| Q8_0 | 8 | 49.6 GB | ~52 GB | Negligible loss |
| Q6_K | 6 | 38.4 GB | ~41 GB | Minimal |
| Q5_K_M | 5 | 32.2 GB | ~35 GB | Minor |
| Q4_K_M | 4 | 26.4 GB | ~29 GB | Noticeable on complex tasks |
| Q3_K_M | 3 | 20.4 GB | ~23 GB | Significant degradation |
| Q2_K | 2 | 15.6 GB | ~18 GB | Severe โ not recommended |
*VRAM needed = model size + ~2-3GB for KV cache (4K context) + framework overhead. Longer context adds more.
What GPU Can Run 8x7B?
| GPU Tier | Best Quantization | Context Limit | Verdict |
|---|---|---|---|
| 8GB (RTX 4060) | None | โ | Won’t fit. Not even Q2_K. |
| 12GB (RTX 3060) | None | โ | Won’t fit at any useful quantization. |
| 16GB (RTX 4060 Ti 16GB) | Q2_K (barely) | ~2K tokens | Technically possible, practically useless. Severe quality loss. |
| 24GB (RTX 3090/4090) | Q3_K_M | ~4K tokens | Tight fit with degraded quality. Workable for short conversations. |
| 32GB (RTX 5090) | Q4_K_M | ~8K tokens | The real sweet spot. Good quality with decent context. |
| 48GB (dual 24GB / A6000) | Q6_K | ~16K tokens | Comfortable. Full quality with room for context. |
The honest assessment: Mixtral 8x7B is awkward on consumer hardware. It’s too big for 24GB at good quality, and if you’re buying 32GB+ hardware, dense models like Qwen 3 32B give you better quality at Q4_K_M (~20GB) with VRAM to spare.
Mixtral 8x22B VRAM Requirements
The big one. Here’s what you’re looking at:
| Quantization | Bits per Weight | Model Size | VRAM Needed* | Quality Impact |
|---|---|---|---|---|
| FP16 | 16 | ~282 GB | ~285 GB | Baseline |
| Q8_0 | 8 | 149 GB | ~152 GB | Negligible loss |
| Q6_K | 6 | 116 GB | ~119 GB | Minimal |
| Q5_K_M | 5 | 100 GB | ~103 GB | Minor |
| Q4_K_M | 4 | 85.6 GB | ~88 GB | Noticeable on complex tasks |
| Q3_K_M | 3 | 67.8 GB | ~71 GB | Significant degradation |
| Q2_K | 2 | 52.1 GB | ~55 GB | Severe โ not recommended |
*Assumes 4K context. 8x22B’s 64K context window at full length adds substantially more.
What GPU Can Run 8x22B?
| GPU Setup | Best Quantization | Context Limit | Verdict |
|---|---|---|---|
| 24GB single | None | โ | Not happening. |
| 2x 24GB (48GB total) | Q2_K | ~4K tokens | Barely. Quality is bad. |
| 48GB (A6000 / L40) | Q3_K_M | ~4K tokens | Tight. Degraded quality. |
| 2x 48GB (96GB total) | Q4_K_M | ~8K tokens | The minimum serious setup. |
| 80GB (A100 / H100) | Q4_K_M | ~16K tokens | Comfortable single-GPU option. |
| 128GB+ | Q6_K+ | ~32K tokens | Full quality with good context. |
Mixtral 8x22B is a datacenter model that people try to run on consumer hardware. Unless you have 96GB+ of GPU memory across multiple cards, look at Llama 3.1 70B instead โ it fits on 2x 24GB GPUs at Q4_K_M (~40GB) and delivers comparable quality.
How Context Length Eats Your VRAM
The tables above assume short context (~4K tokens). But Mixtral 8x7B supports 32K and 8x22B supports 64K. The KV cache grows linearly with context length and eats into your available VRAM.
Here’s the key insight: KV cache scales with active parameters, not total parameters. Since only 2 experts run per token, the KV cache for 8x7B scales like a ~13B model, and for 8x22B like a ~39B model. This is one area where MoE actually helps.
KV Cache VRAM by Context Length
| Context Length | Mixtral 8x7B KV Cache | Mixtral 8x22B KV Cache |
|---|---|---|
| 2K tokens | ~0.3 GB | ~0.8 GB |
| 4K tokens | ~0.5 GB | ~1.5 GB |
| 8K tokens | ~1.0 GB | ~3.0 GB |
| 16K tokens | ~2.0 GB | ~6.0 GB |
| 32K tokens | ~4.0 GB | ~12.0 GB |
| 64K tokens | N/A | ~24.0 GB |
For a deeper dive into how context length affects VRAM, see our context length explainer.
At 32K context on Mixtral 8x7B with Q4_K_M, you need ~26GB (model) + ~4GB (KV cache) + ~2GB (overhead) = ~32GB total. That’s exactly one RTX 5090. On a 24GB card, you’re limited to roughly 4K-8K tokens before running out of memory.
Mixtral vs Dense Models: The Honest Comparison
This is where people should pay close attention. MoE models made sense when there weren’t good dense alternatives at the same quality tier. In 2026, the landscape has changed.
Mixtral 8x7B vs Dense Alternatives
Mixtral 8x7B performs roughly like a dense 30B model on benchmarks. Here’s how the VRAM compares:
| Model | Quality Tier | VRAM at Q4_K_M | Fits 24GB? |
|---|---|---|---|
| Mixtral 8x7B | ~30B dense | ~29 GB | No |
| Qwen 3 32B | 32B dense | ~20 GB | Yes, with room |
| DeepSeek-R1-Distill-32B | 32B dense | ~20 GB | Yes, with room |
| Llama 3.3 70B | 70B dense | ~40 GB | No (needs 2x 24GB) |
Qwen 3 32B and DeepSeek-R1-Distill-32B both fit on a single 24GB GPU at Q4_K_M with VRAM left for context. They match or beat Mixtral 8x7B on most benchmarks. Unless you specifically need Mixtral’s architecture for a fine-tune or have legacy infrastructure, dense 32B models are the better choice on the same hardware.
For more on these alternatives, see our Qwen Models Guide and DeepSeek Models Guide.
Mixtral 8x22B vs Dense Alternatives
Mixtral 8x22B competes with Llama 3.1 70B. The 64K context window is its main advantage.
| Model | Quality Tier | VRAM at Q4_K_M | Context |
|---|---|---|---|
| Mixtral 8x22B | ~70B dense | ~88 GB | 64K |
| Llama 3.1 70B | 70B dense | ~40 GB | 128K |
| Qwen 2.5 72B | 72B dense | ~42 GB | 128K |
Llama 3.1 70B needs less than half the VRAM and has double the context window. On raw benchmarks, it’s competitive or better. The only scenario where 8x22B wins is if you’ve invested in the infrastructure and need its specific MoE speed characteristics โ inference speed roughly matches a 39B dense model, which is faster than a 70B dense model on the same hardware.
Best Quantization Sweet Spots
For Mixtral 8x7B
| Your Hardware | Recommendation |
|---|---|
| 24GB GPU (RTX 3090/4090) | Q3_K_M with 4K context. Functional but not great. Honestly, run Qwen 3 32B at Q4_K_M instead โ better quality, better fit. |
| 32GB GPU (RTX 5090) | Q4_K_M with 8K context. The proper sweet spot if you want MoE. |
| 48GB (A6000 or dual 24GB) | Q5_K_M or Q6_K. Full quality with room for long context. |
For Mixtral 8x22B
| Your Hardware | Recommendation |
|---|---|
| 2x 24GB (48GB total) | Don’t. Run Llama 3.1 70B at Q4_K_M instead. |
| 48GB single card | Q3_K_M with very limited context. Marginal experience. |
| 2x 48GB (96GB total) | Q4_K_M with 8K context. First viable setup. |
| 80GB single card (A100) | Q4_K_M with 16K context. Comfortable. |
When Mixtral Still Makes Sense
After all that, you might wonder why anyone runs Mixtral in 2026. There are a few valid reasons:
Inference speed. MoE’s core advantage is speed at quality. Mixtral 8x7B gives you 30B-class quality at 13B-class speed. If you have the VRAM and need fast responses, that’s a real benefit. On a 48GB card, 8x7B at Q6_K generates tokens noticeably faster than a dense 30B at the same quantization.
Fine-tuned variants. There’s a large ecosystem of Mixtral fine-tunes for specific tasks โ roleplay, coding, instruction following. If a fine-tune exists for your exact use case and nothing comparable exists for dense models, that’s a reason.
8x22B’s 64K native context. If you need long context on self-hosted infrastructure and have the hardware, 8x22B at Q4 handles it well. But Llama 3.1 70B at 128K context on less VRAM is usually the better choice.
Existing deployments. If your setup already runs Mixtral and works, there’s no urgent reason to migrate. The model hasn’t gotten worse โ the alternatives have just gotten better.
The Bottom Line
Mixtral 8x7B needs ~26-29GB at Q4_K_M. It won’t fit on a 24GB GPU with usable context. If you have 32GB+ VRAM and want MoE speed benefits, it’s a solid choice. If you have 24GB and want 30B-class quality, skip Mixtral and run Qwen 3 32B.
Mixtral 8x22B needs ~86-88GB at Q4_K_M. It’s a datacenter model. If you don’t have 96GB+ of GPU memory, run Llama 3.1 70B instead โ half the VRAM, comparable quality, more context.
The MoE architecture is genuinely clever, but in 2026, dense models have closed the quality gap while being far more VRAM-efficient. Choose Mixtral for speed-at-quality when you have the memory budget. Choose dense models when VRAM is your constraint โ which, for most people reading this, it is.
For GPU buying recommendations at every price point, see our GPU Buying Guide and Used RTX 3090 Buying Guide.