What Can You Actually Run on 4GB VRAM?
๐ More on this topic: Best Models Under 3B Parameters ยท CPU-Only LLMs ยท 8GB VRAM Guide
Let’s be direct: 4GB of VRAM is not a lot. It was entry-level five years ago, and it’s the absolute floor for local AI today. But “floor” doesn’t mean “useless.” If you’ve got a GTX 1050 Ti sitting in an old PC or a GTX 1650 in a gaming laptop, you can do more than you’d expect โ as long as you pick the right models and don’t try to punch above your weight class.
This guide covers exactly what fits, what runs well, what barely works, and when you’re better off going CPU-only or upgrading. No sugarcoating.
Which GPUs Have 4GB?
| GPU | CUDA/Stream Cores | Memory Bandwidth | Used Price | Notes |
|---|---|---|---|---|
| GTX 1050 Ti | 768 CUDA | 112 GB/s | $65-80 | Most common 4GB card. Pascal architecture. |
| GTX 1650 | 896 CUDA | 128 GB/s | $65-75 | Slightly faster. Turing, but no tensor cores. |
| RX 570 (4GB) | 2048 stream | 224 GB/s | $45-65 | Higher bandwidth but AMD ROCm dropped Polaris โ must use Vulkan. |
| RX 580 (4GB) | 2304 stream | 256 GB/s | $50-55 | Same ROCm problem. Use llama.cpp with Vulkan backend. |
The AMD cards have 2x the memory bandwidth of the NVIDIA cards, which theoretically means faster token generation. But Ollama doesn’t support Vulkan, so you’ll need to use llama.cpp directly. If that’s not your thing, stick with NVIDIA โ Ollama just works with CUDA.
What Actually Fits in 4GB
Here’s the math. Your 4GB needs to hold model weights + KV cache + ~500MB of runtime overhead. That leaves roughly 3.5GB for the model and its context.
| Model | Quant | Weight Size | Fits in 4GB? | Room for Context? |
|---|---|---|---|---|
| Qwen 2.5 0.5B | Q4_K_M | 491 MB | Easily | Plenty (32K+ tokens) |
| Qwen 2.5 0.5B | Q8_0 | 676 MB | Easily | Plenty |
| Llama 3.2 1B | Q4_K_M | ~700 MB | Easily | Plenty |
| Qwen 2.5 1.5B | Q4_K_M | 1.12 GB | Yes | Good (16K+ tokens) |
| Qwen 2.5 1.5B | Q8_0 | 1.89 GB | Yes | Moderate (4-8K tokens) |
| Gemma 2 2B | Q4_K_M | 1.71 GB | Yes | Moderate |
| Qwen 2.5 3B | Q4_K_M | 2.10 GB | Yes | Limited (2-4K tokens) |
| Llama 3.2 3B | Q4_K_M | ~1.8 GB | Yes | Limited (2-4K tokens) |
| Phi-3.5 Mini 3.8B | Q4_K_M | ~2 GB | Tight | Very limited |
| Qwen 2.5 3B | Q8_0 | 3.62 GB | Barely | Almost none |
| Any 7B | Q2_K | ~3.0 GB | Technically | Unusable context + terrible quality |
| Any 7B | Q4_K_M | ~4.0 GB | No | โ |
The sweet spot for 4GB: Qwen 2.5 3B at Q4_K_M. It’s 2.1GB, leaves ~1.4GB for KV cache and overhead, and the model is genuinely capable.
The trap: 7B at Q2. Yes, a Q2_K 7B model fits in ~3GB. But Q2 quality is severely degraded โ outputs are incoherent more often than not. Multiple benchmarks show a 3B model at Q4 outperforms a 7B at Q2. Don’t bother.
Real-World Performance
These speeds come from benchmarks on the Quadro P1000 (4GB, comparable to GTX 1050 Ti class) and bandwidth-based estimates for the GTX 1650:
| Model | Quant | GTX 1050 Ti (est.) | GTX 1650 (est.) |
|---|---|---|---|
| Qwen 2.5 0.5B | Q4 | ~50-55 tok/s | ~55-65 tok/s |
| TinyLlama 1.1B | Q4 | ~55-62 tok/s | ~60-70 tok/s |
| Llama 3.2 1B | Q4 | ~25-30 tok/s | ~30-40 tok/s |
| Qwen 2.5 1.5B | Q4 | ~30-35 tok/s | ~35-45 tok/s |
| Gemma 2 2B | Q4 | ~18-20 tok/s | ~22-28 tok/s |
| Qwen 2.5 3B | Q4 | ~17-20 tok/s | ~22-30 tok/s |
| Llama 3.2 3B | Q4 | ~18-20 tok/s | ~25-32 tok/s |
| Phi-3.5 Mini 3.8B | Q4 | ~17-19 tok/s | ~20-25 tok/s |
Everything above 15 tok/s feels responsive for chat. You’re fine with any model up to 3B on these cards.
What Works Well on 4GB
These use cases are genuinely practical on a 4GB GPU:
Chat and Q&A (3B models): Qwen 2.5 3B and Llama 3.2 3B handle general conversation, question answering, and brainstorming well. You won’t mistake it for GPT-4, but for quick local tasks it’s solid.
Simple coding assistance: Qwen 2.5 3B Instruct handles code completion, explaining functions, and writing short scripts. Not full-project coding, but useful for autocomplete and quick snippets.
Text classification and extraction: Small models excel at structured tasks like sentiment analysis, entity extraction, and categorization. A 1.5B model at Q8 runs at 30+ tok/s and handles these well.
Summarization (short docs): Feed in a few paragraphs and get a summary. Keep your context window short โ 2-4K tokens is the practical limit on 4GB with a 3B model.
Embeddings for RAG: Embedding models like nomic-embed-text (270MB) or all-minilm (80MB) fit easily. You can run embeddings on GPU while keeping the LLM on CPU if needed. See our RAG guide for setup.
What Doesn’t Work on 4GB
Be honest with yourself about these limitations:
7B+ models at usable quality: You can’t run Llama 3.1 8B, Mistral 7B, or Qwen 2.5 7B at Q4 or above โ they simply don’t fit. Q2/Q3 fits but the quality is so degraded it’s not worth using. Go CPU-only instead.
Image generation (mostly): SDXL needs 6GB+ and won’t load. Flux is out of the question. SD 1.5 technically works at 512x512 with --medvram flags, but you’re looking at 30-80 seconds per image on a GTX 1050 Ti. It’s possible, but painful. See our Stable Diffusion guide for details.
Long context: With a 3B model at Q4 taking ~2GB of VRAM, you have about 1-1.5GB left for KV cache. That’s roughly 2-4K tokens of context with FP16 KV cache. You can stretch to 6-8K tokens with Q8 KV cache quantization, but forget about 16K+ conversations.
Running model + other GPU tasks: If your GPU is also driving your display, doing video decode, or running a game โ subtract 200-500MB from your available VRAM. On a 4GB card, that can be the difference between a model loading or not.
GPU Offloading: The Partial Solution
If you want to try a 7B model, you can split it between GPU and CPU. Put some layers in VRAM (fast) and the rest in system RAM (slow). This is called partial offloading.
In Ollama
# Set number of layers on GPU (lower = more on CPU)
# A 7B model has ~32-36 layers
OLLAMA_NUM_GPU=10 ollama run llama3.1:8b
Or in a Modelfile:
FROM llama3.1:8b
PARAMETER num_gpu 10
PARAMETER num_ctx 2048
In llama.cpp
./llama-cli -m model-Q4_K_M.gguf -ngl 10 -c 2048 -p "Hello"
The -ngl flag controls how many layers go to the GPU. Start low and increase until you’re near the VRAM limit.
Is Partial Offloading Worth It?
Honest answer: usually not on 4GB.
| Config | 7B Q4 Speed |
|---|---|
| Full GPU (needs 6-8GB VRAM) | 30-40 tok/s |
| Partial offload (~10 layers on 4GB GPU) | 4-10 tok/s |
| CPU-only (modern CPU, 16GB+ RAM) | 7-15 tok/s |
Partial offloading on a 4GB card is often slower than just running on a decent CPU. The PCIe bus becomes the bottleneck โ data constantly shuttles between VRAM and system RAM at 16 GB/s (PCIe 3.0), while your VRAM bandwidth is 112-256 GB/s.
When partial offload helps: If your CPU is old or slow (pre-Ryzen, low RAM bandwidth), even 10 layers on GPU can give a meaningful speedup. But if you have a modern Ryzen or Intel with DDR5, CPU-only is often faster for 7B models.
Should You Upgrade or Go CPU-Only?
This is the real question. Here are your options:
Option 1: Go CPU-Only (Free)
If you have 16-32GB of RAM, CPU-only inference is a legitimate option:
- 7B models at Q4: 7-15 tok/s on a modern CPU
- 13B models at Q4: 3-7 tok/s with 32GB RAM
- No VRAM limitations โ context length only limited by RAM
- DDR5 systems get noticeably better speeds than DDR4
Choose CPU-only if: You want to run 7B+ models and don’t want to spend money. Your 4GB GPU is still faster for 1-3B models, so use GPU for small models and CPU for larger ones.
Option 2: Upgrade the GPU ($120-200)
The used GPU market has great options that make 4GB look like a different world:
| GPU | Used Price | VRAM | What It Unlocks |
|---|---|---|---|
| RTX 2060 | ~$120 | 6GB | 7B models at Q4 fully on GPU |
| RX 6600 | ~$150 | 8GB | 8B models comfortably, some 13B |
| RTX 3060 12GB | ~$200 | 12GB | 13-14B models, serious local AI |
The RTX 3060 12GB at ~$200 is the most recommended budget LLM card in the community. It’s $16.67 per GB of VRAM and opens up models that are genuinely competitive with cloud AI for everyday tasks.
Choose upgrading if: You want 7B+ models at full speed, plan to do image generation, or want headroom to grow.
The Honest Take
4GB VRAM is the floor, not a foundation. It’s enough to experiment with small models, learn the tools, and figure out if local AI is for you. But if you get hooked โ and you probably will โ you’ll want more VRAM within a week. Budget $150-200 for a used GPU upgrade and consider it an investment. The jump from 4GB to 12GB is the biggest quality-of-life improvement in local AI.
Getting the Most Out of 4GB
If you’re sticking with 4GB for now, these tips help:
Set context length explicitly. Ollama and llama.cpp default to large context windows that will OOM on 4GB. Always set num_ctx to 2048 or 4096 max.
# Ollama
OLLAMA_NUM_CTX=2048 ollama run qwen2.5:3b
# llama.cpp
./llama-cli -m model.gguf -ngl 99 -c 2048
Enable KV cache quantization. This halves the memory used by context:
# Ollama
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama serve
# llama.cpp
./llama-cli -m model.gguf -ngl 99 -c 4096 -ctk q8_0 -ctv q8_0
With Q8 KV cache, a 3B model at Q4 can handle 4-6K tokens of context instead of 2-3K. No meaningful quality loss.
Try Flash Attention. On the GTX 1650 (Turing), Flash Attention gives a modest ~10% speedup on prompt processing and reduces VRAM usage. On the GTX 1050 Ti (Pascal), it saves VRAM but may not improve speed. Worth enabling either way:
# llama.cpp
./llama-cli -m model.gguf -ngl 99 --flash-attn -c 2048
Pick models with small KV caches. Qwen 2.5 uses only 2 KV heads, giving it dramatically smaller KV cache than Llama 3.2 or Phi-3.5 at the same parameter count. At 4K context:
| Model | KV Cache (FP16) | KV Cache (Q8) |
|---|---|---|
| Qwen 2.5 3B | 144 MB | ~72 MB |
| Llama 3.2 3B | 448 MB | ~224 MB |
| Phi-3.5 Mini 3.8B | 384 MB | ~192 MB |
Qwen 2.5 3B uses 3x less KV cache than Llama 3.2 3B. On a 4GB card, that’s the difference between 4K and 2K context. Qwen 2.5 is the best model family for 4GB VRAM.
Close everything else. Your browser, Discord, even your desktop compositor uses VRAM. On a 4GB card, 200MB matters. Close what you can before loading a model.
The Bottom Line
4GB VRAM is enough to get started with local AI. Run Qwen 2.5 3B at Q4, learn Ollama, try RAG with small models, experiment with what’s possible. You’ll get responsive chat, decent summarization, and basic coding help.
But don’t fight the hardware. If you need 7B+ models, go CPU-only or buy a used GPU. The RTX 3060 12GB at $200 transforms the experience from “making it work” to “this is actually good.” The used RTX 3090 at ~$750 opens up 24GB of possibilities that make 4GB feel like a different era.
Start small, learn the tools, upgrade when you’re ready.