CUDA Out of Memory: What It Means and How to Fix It
More on this topic: VRAM Requirements · Quantization Explained · Ollama Troubleshooting · Best Local Coding Models · Planning Tool
You loaded a model. It crashed. The error says something like:
CUDA error: out of memory
Or in Ollama:
llama runner exited, you may not have enough memory to run the model
Your model doesn’t fit in your GPU’s VRAM. Here’s how to fix it — fastest fixes first.
Fix It (Ranked by Speed)
1. Reduce Context Length (30 Seconds)
This is the fix most people miss. The KV cache, where the model stores your conversation context, scales linearly with context length. It doesn’t show up in the model’s listed size, so people don’t budget for it.
| Model | Weights (Q4) | KV Cache at 4K | KV Cache at 16K | KV Cache at 32K |
|---|---|---|---|---|
| 8B | ~5GB | ~0.6GB | ~2.4GB | ~5GB |
| 14B | ~8GB | ~1.0GB | ~4.0GB | ~8GB |
| 32B | ~20GB | ~2.0GB | ~8.0GB | ~16GB |
An 8B model fits on 8GB VRAM at 4K context. At 32K, it needs 11GB. That’s how context kills you.
Ollama:
# During a session
ollama run yourmodel /set parameter num_ctx 2048
# Or globally before starting the server
export OLLAMA_CONTEXT_LENGTH=4096
ollama serve
As of Ollama v0.17.0, context length auto-scales based on available VRAM. But if you’ve set an explicit num_ctx in a Modelfile or API call, that override wins and can blow your budget.
llama.cpp:
./llama-cli -m model.gguf -c 4096 -ngl 99 -p "Your prompt"
LM Studio: Find Context Length in the model settings panel and drop it. Start at 4096.
How low can you go? 2048 handles single-turn Q&A and short code generation. 4096 covers most conversations. Only go higher if you actually need it for RAG or long documents.
Expected savings: 1-8GB depending on what you’re dropping from.
2. Quantize the KV Cache (2 Minutes)
Most people don’t know this exists. KV cache quantization cuts cache memory roughly in half with negligible quality loss. Real numbers: a 1,792MB cache drops to 952MB with q8_0. That’s 47% gone.
Ollama:
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama serve
Options: f16 (default, full size), q8_0 (half, negligible loss), q4_0 (one-third, slight loss on long contexts). Start with q8_0.
Flash Attention (OLLAMA_FLASH_ATTENTION=1) should always be on. Less memory, faster inference, no quality tradeoff.
llama.cpp:
./llama-server -m model.gguf -fa -ctk q8_0 -ctv q8_0 -ngl 99
-fa enables flash attention. -ctk and -ctv set cache quantization for keys and values.
LM Studio: Enable Flash Attention in model settings. KV cache quant support depends on version.
Expected savings: 0.5-4GB depending on context length.
3. Close Other GPU Apps
Check what’s already eating VRAM:
nvidia-smi
Chrome with hardware acceleration, Discord, a game in the background — each can claim hundreds of megabytes. Together, 0.5-3GB gone before you load anything. Close them. On Linux, nvtop gives a real-time view.
4. Use a Smaller Quantization
Quantization compresses model weights. Dropping from Q6_K to Q4_K_M saves ~30% VRAM with a small quality hit.
| Quantization | 7B Model | 14B Model | 32B Model | Quality |
|---|---|---|---|---|
| Q8_0 | ~8 GB | ~15 GB | ~34 GB | Baseline |
| Q6_K | ~6 GB | ~12 GB | ~26 GB | Negligible loss |
| Q4_K_M | ~5 GB | ~9 GB | ~20 GB | Small loss |
| Q3_K_M | ~4 GB | ~7 GB | ~17 GB | Noticeable loss |
Q4_K_M is the sweet spot. Ollama uses it by default — if you’re already on Q4_K_M, this fix won’t help. In llama.cpp or LM Studio, download a different GGUF from HuggingFace.
Expected savings: 2-15GB depending on what you drop to.
5. Unload Other Models
Ollama keeps models in VRAM after you stop chatting (5 minutes by default). If you tested three models, they’re all competing for space.
# See what's loaded
ollama ps
# Unload a specific model
ollama stop llama3.2
# Or limit to one model at a time
export OLLAMA_MAX_LOADED_MODELS=1
6. Partial CPU Offload
If you’re almost there — model needs 10GB and you have 8GB — offload some layers to system RAM.
llama.cpp:
# Offload only 20 layers to GPU, rest goes to CPU
./llama-cli -m model.gguf -ngl 20 -c 4096
# Increase -ngl until OOM, then back off by 2-3
Ollama handles this automatically when a model exceeds VRAM. Check with ollama ps — the Processor column shows the split. If it says 48%/52% CPU/GPU, that’s why it’s slow.
LM Studio: Set “GPU Offload” layer count in model settings.
The tradeoff is speed. Full GPU: 50 tok/s. Half on CPU: 10-15 tok/s. A few layers offloaded: minor hit. Half the model offloaded: painful. But 10 tok/s beats a crash.
7. Accept You Need a Smaller Model
Sometimes the math doesn’t work. Here’s what actually fits at Q4_K_M with room for a reasonable context window:
| Your VRAM | Comfortable Max | Tight But Works | Don’t Bother |
|---|---|---|---|
| 8GB | 8B at 4K ctx | 14B Q3 at 2K ctx | 32B anything |
| 12GB | 14B at 8K ctx | 32B Q3 at 2K ctx | 70B anything |
| 16GB | 14B Q6 at 8K ctx | 32B Q4 at 4K ctx | 70B anything |
| 24GB | 32B Q4 at 16K ctx | 70B Q3 at 4K ctx | 70B at long ctx |
If your model is in the “Don’t Bother” column, no configuration trick will save you. A 14B Q4 that runs at full speed on GPU beats a 32B that’s half on CPU every time. For a full breakdown, see VRAM requirements by model size.
If you keep hitting the wall, a used RTX 3090 at 24GB ($700-900) is the most cost-effective upgrade in local AI.
The OLLAMA_NUM_PARALLEL Trap
This catches people. If you set OLLAMA_NUM_PARALLEL=4 and num_ctx=4096, Ollama allocates KV cache for 4 x 4096 = 16,384 tokens. That can double your VRAM usage with no warning.
If you’re running into OOM and you’ve set OLLAMA_NUM_PARALLEL above 1 — especially with Open WebUI where parallel requests are on by default — try:
export OLLAMA_NUM_PARALLEL=1
Why This Happens: The Math
Three things compete for VRAM:
Total VRAM = Model weights + KV cache + Overhead (~500MB-1GB)
- Weights: (Parameters x Bits per weight) / 8. A 7B at Q4 = ~3.5GB raw.
- KV cache: Grows linearly with context length. At 8K on a 7B model, ~1.5GB. At 32K, ~5GB. This is the part people forget.
- Overhead: CUDA runtime, activation memory, scratch buffers.
This is why you can load a model fine but crash when the conversation gets long. The KV cache grows as you talk. And it’s why OLLAMA_NUM_PARALLEL is dangerous — it multiplies the cache.
Common Traps
Ollama keeps models loaded. After you stop chatting, the model stays in VRAM for 5 minutes. Switching models means both compete for space. Use ollama ps and ollama stop.
VRAM fragmentation. Loading and unloading multiple models can fragment VRAM. If OOM hits on a model that should fit, restart the Ollama service. Clears fragmented allocations.
Desktop environment eats VRAM. Linux compositors and Windows Aero use 200-500MB. If you’re on the edge, that matters. Headless server setups reclaim it.
Qwen 3.5 GPU/CPU split crash. Qwen 3.5 models had a bug in Ollama before v0.17.5 where splitting across GPU and CPU would crash (not just slow down — crash). If you’re hitting this, update Ollama.
Quick Diagnostic Checklist
Run through this when you hit OOM:
- What’s your context length? (
ollama show modelor check-cflag) — if above 4096, try 2048 - Is KV cache quantized? Set
OLLAMA_KV_CACHE_TYPE=q8_0if not - Is Flash Attention on?
OLLAMA_FLASH_ATTENTION=1— always should be - Are other models loaded?
ollama ps— unload them - Is something else using VRAM?
nvidia-smi— close it - What quantization are you running? Drop to Q4_K_M if higher
- Is OLLAMA_NUM_PARALLEL above 1? Try setting to 1
- Is partial offload happening?
ollama ps, Processor column — reduce model size or context to get back to 100% GPU
If you’ve tried all of these and it still doesn’t fit, you need a smaller model or a bigger GPU. That’s not a configuration problem. That’s arithmetic.
Bottom Line
CUDA OOM is almost always one of two things: context length too high, or model too big for your GPU. Fix order:
- Drop
num_ctxto 2048-4096 - Enable Flash Attention + KV cache
q8_0 - Close apps eating VRAM
- Drop quantization to Q4_K_M
- Unload other models
- Partially offload to CPU
- Use a smaller model
Most of the time, step 1 fixes it. People set 32K context because the model supports it, not because they need it. That alone can double your VRAM usage.
For what fits where, see our VRAM requirements guide. For models that balance quality and VRAM, check our best local coding models guide.
Get notified when we publish new guides.
Subscribe — free, no spam