More on this topic: VRAM Requirements · Quantization Explained · Ollama Troubleshooting · Best Local Coding Models · Planning Tool

You loaded a model. It crashed. The error says something like:

CUDA error: out of memory

Or in Ollama:

llama runner exited, you may not have enough memory to run the model

Your model doesn’t fit in your GPU’s VRAM. Here’s how to fix it — fastest fixes first.


Fix It (Ranked by Speed)

1. Reduce Context Length (30 Seconds)

This is the fix most people miss. The KV cache, where the model stores your conversation context, scales linearly with context length. It doesn’t show up in the model’s listed size, so people don’t budget for it.

ModelWeights (Q4)KV Cache at 4KKV Cache at 16KKV Cache at 32K
8B~5GB~0.6GB~2.4GB~5GB
14B~8GB~1.0GB~4.0GB~8GB
32B~20GB~2.0GB~8.0GB~16GB

An 8B model fits on 8GB VRAM at 4K context. At 32K, it needs 11GB. That’s how context kills you.

Ollama:

# During a session
ollama run yourmodel /set parameter num_ctx 2048

# Or globally before starting the server
export OLLAMA_CONTEXT_LENGTH=4096
ollama serve

As of Ollama v0.17.0, context length auto-scales based on available VRAM. But if you’ve set an explicit num_ctx in a Modelfile or API call, that override wins and can blow your budget.

llama.cpp:

./llama-cli -m model.gguf -c 4096 -ngl 99 -p "Your prompt"

LM Studio: Find Context Length in the model settings panel and drop it. Start at 4096.

How low can you go? 2048 handles single-turn Q&A and short code generation. 4096 covers most conversations. Only go higher if you actually need it for RAG or long documents.

Expected savings: 1-8GB depending on what you’re dropping from.

2. Quantize the KV Cache (2 Minutes)

Most people don’t know this exists. KV cache quantization cuts cache memory roughly in half with negligible quality loss. Real numbers: a 1,792MB cache drops to 952MB with q8_0. That’s 47% gone.

Ollama:

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama serve

Options: f16 (default, full size), q8_0 (half, negligible loss), q4_0 (one-third, slight loss on long contexts). Start with q8_0.

Flash Attention (OLLAMA_FLASH_ATTENTION=1) should always be on. Less memory, faster inference, no quality tradeoff.

llama.cpp:

./llama-server -m model.gguf -fa -ctk q8_0 -ctv q8_0 -ngl 99

-fa enables flash attention. -ctk and -ctv set cache quantization for keys and values.

LM Studio: Enable Flash Attention in model settings. KV cache quant support depends on version.

Expected savings: 0.5-4GB depending on context length.

3. Close Other GPU Apps

Check what’s already eating VRAM:

nvidia-smi

Chrome with hardware acceleration, Discord, a game in the background — each can claim hundreds of megabytes. Together, 0.5-3GB gone before you load anything. Close them. On Linux, nvtop gives a real-time view.

4. Use a Smaller Quantization

Quantization compresses model weights. Dropping from Q6_K to Q4_K_M saves ~30% VRAM with a small quality hit.

Quantization7B Model14B Model32B ModelQuality
Q8_0~8 GB~15 GB~34 GBBaseline
Q6_K~6 GB~12 GB~26 GBNegligible loss
Q4_K_M~5 GB~9 GB~20 GBSmall loss
Q3_K_M~4 GB~7 GB~17 GBNoticeable loss

Q4_K_M is the sweet spot. Ollama uses it by default — if you’re already on Q4_K_M, this fix won’t help. In llama.cpp or LM Studio, download a different GGUF from HuggingFace.

Expected savings: 2-15GB depending on what you drop to.

5. Unload Other Models

Ollama keeps models in VRAM after you stop chatting (5 minutes by default). If you tested three models, they’re all competing for space.

# See what's loaded
ollama ps

# Unload a specific model
ollama stop llama3.2

# Or limit to one model at a time
export OLLAMA_MAX_LOADED_MODELS=1

6. Partial CPU Offload

If you’re almost there — model needs 10GB and you have 8GB — offload some layers to system RAM.

llama.cpp:

# Offload only 20 layers to GPU, rest goes to CPU
./llama-cli -m model.gguf -ngl 20 -c 4096
# Increase -ngl until OOM, then back off by 2-3

Ollama handles this automatically when a model exceeds VRAM. Check with ollama ps — the Processor column shows the split. If it says 48%/52% CPU/GPU, that’s why it’s slow.

LM Studio: Set “GPU Offload” layer count in model settings.

The tradeoff is speed. Full GPU: 50 tok/s. Half on CPU: 10-15 tok/s. A few layers offloaded: minor hit. Half the model offloaded: painful. But 10 tok/s beats a crash.

7. Accept You Need a Smaller Model

Sometimes the math doesn’t work. Here’s what actually fits at Q4_K_M with room for a reasonable context window:

Your VRAMComfortable MaxTight But WorksDon’t Bother
8GB8B at 4K ctx14B Q3 at 2K ctx32B anything
12GB14B at 8K ctx32B Q3 at 2K ctx70B anything
16GB14B Q6 at 8K ctx32B Q4 at 4K ctx70B anything
24GB32B Q4 at 16K ctx70B Q3 at 4K ctx70B at long ctx

If your model is in the “Don’t Bother” column, no configuration trick will save you. A 14B Q4 that runs at full speed on GPU beats a 32B that’s half on CPU every time. For a full breakdown, see VRAM requirements by model size.

If you keep hitting the wall, a used RTX 3090 at 24GB ($700-900) is the most cost-effective upgrade in local AI.


The OLLAMA_NUM_PARALLEL Trap

This catches people. If you set OLLAMA_NUM_PARALLEL=4 and num_ctx=4096, Ollama allocates KV cache for 4 x 4096 = 16,384 tokens. That can double your VRAM usage with no warning.

If you’re running into OOM and you’ve set OLLAMA_NUM_PARALLEL above 1 — especially with Open WebUI where parallel requests are on by default — try:

export OLLAMA_NUM_PARALLEL=1

Why This Happens: The Math

Three things compete for VRAM:

Total VRAM = Model weights + KV cache + Overhead (~500MB-1GB)

  • Weights: (Parameters x Bits per weight) / 8. A 7B at Q4 = ~3.5GB raw.
  • KV cache: Grows linearly with context length. At 8K on a 7B model, ~1.5GB. At 32K, ~5GB. This is the part people forget.
  • Overhead: CUDA runtime, activation memory, scratch buffers.

This is why you can load a model fine but crash when the conversation gets long. The KV cache grows as you talk. And it’s why OLLAMA_NUM_PARALLEL is dangerous — it multiplies the cache.


Common Traps

Ollama keeps models loaded. After you stop chatting, the model stays in VRAM for 5 minutes. Switching models means both compete for space. Use ollama ps and ollama stop.

VRAM fragmentation. Loading and unloading multiple models can fragment VRAM. If OOM hits on a model that should fit, restart the Ollama service. Clears fragmented allocations.

Desktop environment eats VRAM. Linux compositors and Windows Aero use 200-500MB. If you’re on the edge, that matters. Headless server setups reclaim it.

Qwen 3.5 GPU/CPU split crash. Qwen 3.5 models had a bug in Ollama before v0.17.5 where splitting across GPU and CPU would crash (not just slow down — crash). If you’re hitting this, update Ollama.


Quick Diagnostic Checklist

Run through this when you hit OOM:

  1. What’s your context length? (ollama show model or check -c flag) — if above 4096, try 2048
  2. Is KV cache quantized? Set OLLAMA_KV_CACHE_TYPE=q8_0 if not
  3. Is Flash Attention on? OLLAMA_FLASH_ATTENTION=1 — always should be
  4. Are other models loaded? ollama ps — unload them
  5. Is something else using VRAM? nvidia-smi — close it
  6. What quantization are you running? Drop to Q4_K_M if higher
  7. Is OLLAMA_NUM_PARALLEL above 1? Try setting to 1
  8. Is partial offload happening? ollama ps, Processor column — reduce model size or context to get back to 100% GPU

If you’ve tried all of these and it still doesn’t fit, you need a smaller model or a bigger GPU. That’s not a configuration problem. That’s arithmetic.


Bottom Line

CUDA OOM is almost always one of two things: context length too high, or model too big for your GPU. Fix order:

  1. Drop num_ctx to 2048-4096
  2. Enable Flash Attention + KV cache q8_0
  3. Close apps eating VRAM
  4. Drop quantization to Q4_K_M
  5. Unload other models
  6. Partially offload to CPU
  7. Use a smaller model

Most of the time, step 1 fixes it. People set 32K context because the model supports it, not because they need it. That alone can double your VRAM usage.

For what fits where, see our VRAM requirements guide. For models that balance quality and VRAM, check our best local coding models guide.