📚 More on this topic: Mixtral VRAM Requirements · Quantization Explained · DeepSeek Models Guide · Planning Tool

Mixtral 8x7B has 46.7 billion parameters but runs at the speed of a 13B model. That single sentence has confused more people than any other fact in local AI.

The confusion is understandable. “Runs like 13B” sounds like it needs 13B-level resources. It doesn’t. You need enough VRAM to hold all 46.7 billion parameters — the same as any dense 46B model. The speed benefit is real. The memory savings are not.

This is the Mixture of Experts architecture, and understanding it properly will save you from the most common hardware mistake in local AI: downloading an MoE model that doesn’t fit on your GPU.


What mixture of experts actually is

A standard “dense” model, like Llama 3.3 70B or Qwen 3 32B, activates every single parameter on every single token. If it has 70 billion parameters, all 70 billion do work each time the model processes a word.

MoE takes a different approach. Instead of one massive feed-forward network (FFN) layer, an MoE model has multiple smaller FFN layers called experts. A lightweight gating network (also called a router) looks at each incoming token and decides which experts should handle it. Only the selected experts activate. The rest sit idle.

Think of it like a hospital. A dense model is a single doctor who handles everything (cardiology, neurology, orthopedics) for every patient. An MoE model is a hospital with eight specialist doctors. Each patient only sees two of them, but the hospital still needs to pay all eight salaries and maintain all eight offices.

The core components

Every MoE model has three parts:

  1. Shared layers: attention layers, embeddings, and layer norms that process every token (same as a dense model)
  2. Expert FFN blocks: multiple parallel feed-forward networks, only some activated per token
  3. Gating network: a small learned router that scores each expert for each token and picks the top-k

The gating network is tiny, typically a single linear layer per MoE block. It takes the hidden state of the current token, produces a score for each expert, applies a softmax, and routes the token to the top-k experts. The outputs from the selected experts are combined using the gating scores as weights.

Token → Attention (shared) → Gating Network → [Expert 2, Expert 5] → Weighted Sum → Output
                                               (6 experts idle)

This routing happens independently for every token. One word might use experts 1 and 4. The next might use experts 3 and 7. The gating network learns during training which experts specialize in what, though the specialization is emergent, not manually designed.


Mixtral 8x7B: the concrete example

Mixtral 8x7B from Mistral AI is the model that brought MoE to consumer-accessible local AI. It’s the clearest example to study.

The numbers

ComponentValue
Total parameters46.7B
Number of experts8 (per MoE layer)
Experts active per token2
Active parameters per token~12.9B
Shared attention parameters~6.4B
Expert FFN parameters~40.3B (8 × ~5B each)
Context window32K tokens

Each of the 32 transformer layers has 8 expert FFN blocks. On every token, the gating network picks the 2 best-matching experts. So you compute the full attention stack (~6.4B params) plus 2 out of 8 expert blocks (~10.5B params), totaling roughly 12.9B active parameters per forward pass.

Why it’s fast

Inference speed is dominated by how many parameters get computed per token. Mixtral computes ~13B parameters per token. A dense 13B model computes ~13B parameters per token. So they run at roughly the same speed, about 30-40 tokens/second on an RTX 3090 at Q4 quantization.

This is the MoE win: you get the quality of a model trained with 46.7B parameters, at the speed of a 13B model. During training, all those expert parameters learned from the data. The model has the knowledge capacity of 46.7B params. It just accesses that knowledge selectively.


The VRAM trap

MoE models need VRAM proportional to TOTAL parameters, not active parameters.

When you load Mixtral 8x7B, all 46.7 billion parameters go into memory. All 8 experts in every layer must be resident in VRAM, even though only 2 are used per token. Why? Because the gating network decides at runtime which 2 experts each token needs. The model can’t know in advance which experts will be called, so all of them must be ready.

The math that catches people

Here’s the scenario that plays out daily in r/LocalLLaMA:

  1. Someone reads “Mixtral 8x7B runs at 13B speed”
  2. They check their 16GB GPU and think “I can run 13B models, so Mixtral should fit”
  3. They download the Q4 GGUF (about 26GB)
  4. They get an out-of-memory error
  5. They’re confused because “it’s supposed to be like a 13B model”

It is not like a 13B model for memory. It is like a 46.7B model for memory. The Q4 quantized version needs approximately 26GB of VRAM, the same as any dense 46B model at Q4. At FP16, it needs about 93GB.

To put it bluntly:

What “Runs Like 13B” MeansWhat It Does NOT Mean
Inference speed of ~13BVRAM usage of ~13B
Tokens per second of ~13BDownload size of ~13B
Compute per token of ~13BDisk space of ~13B

A concrete comparison

Let’s say you have an RTX 4090 with 24GB VRAM:

  • Qwen 3 32B at Q4_K_M: ~20GB VRAM. Fits. All 32B parameters active per token. Excellent quality.
  • Mixtral 8x7B at Q4_K_M: ~26GB VRAM. Does NOT fit in 24GB. Only ~13B active per token.

The dense 32B model that uses every parameter fits on your GPU. The MoE model that only uses 13B parameters per token does not fit. This is the VRAM trap. The model that computes fewer parameters per token actually needs more memory.

Our Planning Tool correctly uses total parameters for MoE VRAM estimates, not active parameters. If your planner or calculator uses active params, throw it out.


MoE models available today

Here’s every MoE model you might consider running locally:

ModelTotal ParamsActive ParamsExperts (Total/Active)ContextLicenseReality Check
Mixtral 8x7B46.7B~13B8/232KApache 2.0The OG consumer MoE. Needs ~26GB at Q4.
Mixtral 8x22B141B~39B8/264KApache 2.0Serious hardware. ~66GB at Q4.
DeepSeek V3671B~37B256/8128KMITMulti-node territory. ~350GB at Q4.
DeepSeek R1671B~37B256/8128KMITReasoning variant of V3. Same hardware needs.
DBRX132B~36B16/432KDatabricks Open16 experts, 4 active. ~66GB at Q4.
Llama 4 Scout109B~17B16/110MLlama 4~55GB at Q4. Full guide here.
Llama 4 Maverick400B~17B128/11MLlama 4~200GB at Q4. Datacenter-class.
Qwen 1.5 MoE A2.7B14.3B~2.7B60/432KQwenSmall MoE experiment. ~8GB at Q4. Niche.

Notice the pattern: total parameters drive VRAM, not active parameters. DeepSeek V3 activates only 37B params per token (fewer than a dense 70B model) but needs 350GB at Q4. That’s more VRAM than most servers have.

For most of these models, the DeepSeek Models Guide and Mistral & Mixtral Guide have detailed setup instructions.


VRAM requirements: Mixtral at every quantization

Since Mixtral models are the most commonly run MoE models locally, here are precise VRAM estimates. These include model weights plus a reasonable KV cache overhead for 4K context.

Mixtral 8x7B (46.7B total)

QuantizationModel SizeVRAM NeededFits On
Q2_K~17 GB~19 GBRTX 3090 24GB (tight)
Q3_K_M~21 GB~23 GBRTX 3090 24GB (tight)
Q4_K_M~26 GB~28 GB2x RTX 3060 12GB, Mac 32GB+
Q5_K_M~32 GB~34 GBMac 36GB+, 2x 24GB GPUs
Q6_K~38 GB~40 GBMac 48GB+, 2x 24GB GPUs
Q8_0~49 GB~52 GBMac 64GB+, A6000 48GB (offload)
FP16~93 GB~96 GB2x A100, Mac 128GB

Mixtral 8x22B (141B total)

QuantizationModel SizeVRAM NeededFits On
Q2_K~52 GB~56 GB2x RTX 3090, Mac 64GB+
Q3_K_M~63 GB~67 GBMac 96GB+, A100 80GB
Q4_K_M~80 GB~85 GBMac 96GB+, A100 80GB (tight)
Q5_K_M~96 GB~101 GBMac 128GB, 2x A100
Q6_K~112 GB~118 GBMac 128GB+, 2x A100
Q8_0~150 GB~156 GBMac 192GB, multi-A100
FP16~282 GB~290 GBMulti-H100 cluster

For full breakdowns including offloading strategies, see the dedicated Mixtral VRAM Requirements guide.


MoE vs. dense: what to run at each VRAM budget

This is the practical table. For each VRAM tier, here’s what actually makes sense.

VRAMBest Dense OptionBest MoE OptionWinnerWhy
8 GBLlama 3.1 8B Q4 or Qwen 3 8B Q4Qwen 1.5 MoE A2.7B Q4 (~8GB)DenseThe MoE is tiny and underperforms 8B dense models
12 GBQwen 3 14B Q4 (~9GB)None that fit wellDenseNo competitive MoE fits in 12GB
16 GBQwen 3 32B Q3 (~16GB)None that fit wellDenseStill no good MoE options at this tier
24 GBQwen 3 32B Q4 (~20GB)Mixtral 8x7B Q2 (~19GB)DenseQwen 3 32B outperforms Mixtral on every benchmark. Not close.
48 GBLlama 3.3 70B Q4 (~42GB)Mixtral 8x22B Q3 (~67GB, too big); Mixtral 8x7B Q8 (~52GB, barely fits)Dense70B dense still beats Mixtral 8x22B which doesn’t even fit. Mixtral 8x7B at Q8 fits but loses to 70B on quality.
80 GBLlama 3.3 70B Q8 (~74GB)Mixtral 8x22B Q4 (~85GB, tight)Dense / TieAt 80GB, you can run either 70B dense at near-lossless quant or 8x22B at Q4. Similar quality, 70B is the safer pick.
2x 48 GBLlama 3.3 70B FP16 (~140GB)Mixtral 8x22B Q5 (~101GB)Depends8x22B at high quant is compelling here. But consider DeepSeek distilled models too.

The pattern is clear: below 48GB, dense models win. MoE only becomes attractive when you have enough VRAM to load the full model at decent quantization AND the MoE model’s total knowledge capacity outweighs what a dense model can offer at the same memory footprint.

For consumer hardware (the single 24GB GPU that most of our readers have), MoE models are almost never the right choice.


How routing actually works

The gating network is simple but important. Here’s what happens at each MoE layer:

# Simplified pseudocode for MoE routing
def moe_forward(hidden_state, experts, gate):
    # Gate produces a score for each expert
    scores = gate(hidden_state)  # shape: [num_experts]

    # Softmax to normalize
    probs = softmax(scores)

    # Select top-k experts (k=2 for Mixtral)
    top_k_indices = topk(probs, k=2)
    top_k_weights = probs[top_k_indices]

    # Renormalize weights so they sum to 1
    top_k_weights = top_k_weights / sum(top_k_weights)

    # Run token through selected experts and combine
    output = sum(weight * experts[idx](hidden_state)
                 for idx, weight in zip(top_k_indices, top_k_weights))

    return output

The gate itself is typically just a linear projection: gate_scores = W_gate @ hidden_state, where W_gate has shape [num_experts, hidden_dim]. For Mixtral with 8 experts and a hidden dimension of 4096, that’s a 8 x 4096 = 32,768 parameter matrix, negligible compared to the expert weights.

Load balancing

A naive gating network might route most tokens to the same 2 “favorite” experts, leaving 6 experts undertrained. To prevent this, MoE training includes an auxiliary load balancing loss that penalizes uneven expert utilization. The original Mixtral paper and the Switch Transformer paper cover this in detail.

In practice, well-trained MoE models distribute tokens roughly evenly across experts, though some experts do develop soft specializations. Research on Mixtral shows that certain experts handle particular languages or domains more often, but no expert is exclusively specialized. Every expert can handle general text.

Why same-family experts specialize

The experts aren’t pre-assigned roles. They start as identical random initializations and differentiate during training through the combination of the gating network’s routing decisions and the load balancing loss. It’s similar to how neurons in a neural network develop feature selectivity: emergent specialization, not designed specialization.

This means you can’t just grab one expert and use it as a standalone model. Each expert is trained in the context of the routing mechanism and the shared attention layers. They’re collaborative specialists, not independent models.


When MoE actually wins

MoE is a good architecture stuck on the wrong hardware. Consumer GPUs don’t have the VRAM to make it worthwhile yet. In a datacenter context, it’s genuinely superior:

MoE advantages

MoE models learn more per training FLOP because the full parameter count sees data during training, but each training step only computes a fraction. DeepSeek V3 was reportedly trained for $5.5M, which is remarkably cheap for a 671B parameter model.

The per-token compute savings also translate directly to lower cost per query in datacenter deployments serving thousands of users. This is why DeepSeek, Google (Gemini uses MoE), and others chose this architecture for production APIs.

On top of the cost savings, MoE gives you a higher quality ceiling. Given the same inference compute budget, an MoE model can be smarter than a dense model because it has more total parameters to store knowledge in. A 46.7B-parameter MoE that runs at 13B speed will generally outperform a 13B dense model trained on the same data.

When dense wins for local AI

If you have 24GB VRAM, you can fit a dense 32B model at Q4 or an MoE 46.7B model at Q2. The dense model at higher quantization will almost certainly perform better because aggressive quantization destroys MoE routing precision.

Dense models are also easier to fine-tune, quantize predictably, and don’t have routing overhead. GGUF quantization for MoE models requires expert-aware handling that’s well-supported in llama.cpp now but was buggy for months after Mixtral’s release.

And dense models degrade more gracefully under quantization. MoE models have the gating network and expert weights both getting quantized, and errors in gating decisions cascade. Sending a token to the wrong expert entirely is worse than slightly imprecise math in the right expert.


The practical calculus

Here’s the decision framework.

24GB VRAM or less

Run dense models. Period. Qwen 3 32B at Q4 is better than any MoE model you can squeeze into 24GB. If you’re on 16GB or 12GB, see our guides for 16GB and 12GB recommendations.

The only exception: if you specifically need the Mixtral fine-tune ecosystem (there are specialized Mixtral fine-tunes for certain tasks), and you’re willing to run Q2_K quantization with degraded quality.

48GB VRAM

Mixtral 8x22B fits at Q3 quantization (~67GB with offloading). But Llama 3.3 70B at Q4 (~42GB) on a single 48GB GPU will outperform it in most use cases with simpler setup. MoE becomes interesting here only for specific tasks where 8x22B’s training diversity helps.

80GB+ VRAM

Now MoE models make real sense. Mixtral 8x22B at Q4-Q5 quantization on an A100 80GB gives you excellent quality. DeepSeek R1 distilled variants (dense) are also compelling at this tier, though.

Multi-GPU / 128GB+ Mac

This is MoE’s sweet spot for local AI. Mixtral 8x22B at Q6+ or even Q8 quantization. If you have the memory, MoE delivers: fast inference with high knowledge capacity. The full DeepSeek V3 at Q4 needs ~350GB, but if you have the hardware, nothing else touches its quality.


What to remember

MoE gives you the speed of a small model but the memory footprint of a big one. Never confuse inference speed with VRAM requirements. Mixtral 8x7B (46.7B total) needs the same memory as a dense 46B model, no exceptions.

Below 24GB VRAM, skip MoE entirely. Dense models at the same memory budget will outperform them. Above 80GB, MoE starts to pay off with faster inference than a dense model of equivalent quality.

DeepSeek, Google Gemini, and Llama 4 all chose MoE. It’s the dominant architecture for frontier models now, and as consumer VRAM grows, MoE will eventually become the default local choice too.

If you have a single 24GB GPU, run Qwen 3 32B and forget about MoE. When 48GB GPUs hit consumer prices, or when you build a multi-GPU setup, come back and revisit.