More on this topic: VRAM Requirements · What Can You Run on 12GB · What Can You Run on 4GB · Quantization Explained · Qwen 3.5 9B Setup Guide

You have 8GB of VRAM. Maybe it’s an RTX 4060, a 3060 Ti, a 3070, or even an older 2080. You’ve seen people running AI chatbots locally and you’re wondering: can I actually do that with my card?

The short answer is yes — with limits. 8GB is the floor for local AI, not the sweet spot. But “the floor” doesn’t mean useless. It means you need to know exactly what fits, what doesn’t, and how to squeeze every megabyte. That’s what this guide covers.


Who This Is For

If you own any of these cards, this guide is for you:

GPUVRAMArchitecture
RTX 40608GBAda Lovelace
RTX 30708GBAmpere
RTX 3060 Ti8GBAmpere
RTX 20808GBTuring
RTX 20708GBTuring
RTX 2060 Super8GBTuring

This is one of the largest GPU audiences out there. The RTX 4060 alone is the best-selling current-gen card. If you already own one, you don’t need to spend another dime to start running AI locally. You just need to know what’s realistic.


The Honest Truth About 8GB

Here’s the deal: 8GB is limiting, but it’s limiting the way a studio apartment is limiting. You can absolutely live there — you just can’t spread out.

A 9B model at Q4_K_M quantization uses roughly 5.5-6GB for weights, leaving 2-2.5GB for the KV cache (working memory) and system overhead. That’s a tight fit, and it means you’ll be making tradeoffs — shorter context windows, no room for multiple models, and nothing bigger than 9B running comfortably.

But here’s what matters: Qwen 3.5 9B at Q4 on an 8GB GPU produces genuinely useful output at ~38 tokens per second, handles images natively, and scores 82.5 on MMLU-Pro — numbers that were 70B territory a year ago. That’s fast enough for interactive chat, coding assistance, and real work. You’re not watching paint dry.


What Runs Well on 8GB

7B-9B Models at Q4: The Sweet Spot

This is where 8GB cards shine. A quantized 7B-9B model at Q4 fits with room to breathe, runs fast, and retains ~90-95% of the original model’s quality.

ModelVRAM Used (Q4_K_M)Speed (RTX 4060)Best For
Qwen 3.5 9B~7 GB~38 tok/sEverything — new default pick
Llama 3.1 8B~5.5 GB~42 tok/sGeneral assistant, writing
Mistral 7B~4.8 GB~45 tok/sFast chat, summarization
DeepSeek R1 8B~5.5 GB~38 tok/sReasoning, math

Qwen 3.5 9B changed the math here. It’s a dense 9B model (all parameters active), the 5.68GB Q4_K_M file needs ~7GB VRAM with 8K context, and it scores 82.5 on MMLU-Pro and 81.7 on GPQA Diamond — numbers that beat models three times its size. It also handles images and video natively from the same weights. No separate vision model needed.

The tradeoff: it’s a tighter fit than the older 7B models. Llama 3.1 8B and Mistral 7B leave more headroom for context and background apps. If you need longer conversations or can’t close Chrome, those are still good picks. But for raw capability per VRAM byte, Qwen 3.5 9B is the new king.

These speeds are real-world numbers with full GPU offload. At 35+ tokens per second, responses feel like a fast typist. You’ll be reading slower than the model generates.

Smaller Models at Higher Quality

If you want more VRAM headroom — for longer conversations or running other apps alongside — smaller models punch above their weight:

ModelVRAM Used (Q4_K_M)Speed (RTX 4060)Best For
Qwen 3.5 3B~2.5 GB~65 tok/sFast chat, coding, multilingual
Qwen 3.5 4B~3.0 GB~58 tok/sStep up from 3B, still light
Llama 3.2 3B~2.5 GB~65 tok/sQuick Q&A, simple tasks
Phi-3 Mini (3.8B)~2.8 GB~60 tok/sReasoning, coding (compact)

Qwen 3.5 3B is part of the same March 2026 small model drop as the 9B. It punches well above its weight for a 3B model — noticeably better at instruction following and coding than Llama 3.2 3B. If you need headroom for long context or want to keep other apps running, start here.

These leave 5GB+ free, which means longer context windows, no VRAM pressure, and snappy responses. They’re less capable than the 9B models, but for quick questions and simple coding tasks, they’re surprisingly good.


What’s Possible But Painful

13B Models at Low Quantization

You can technically squeeze a 13B model into 8GB at Q2 or Q3 quantization. Should you? Usually not.

The numbers tell the story: GPU utilization drops to 25-42% because the model barely fits, inference crawls, and Q2-Q3 quantization degrades quality noticeably. You’re running a larger model badly instead of a smaller model well.

When it’s worth trying: If you need a specific 13B model for a task where the 7B version doesn’t cut it — and you can tolerate slow, lower-quality output. Think of it as a proof-of-concept, not a daily driver.

When to skip it: For everything else. A Llama 3.1 8B at Q4 will outperform a Llama 2 13B at Q2 in most practical tasks, and it’ll do it 5x faster.

Context Length Limits

Here’s the part nobody warns you about: context windows eat VRAM. With a 7B model at Q4_K_M, the KV cache for an 8K context window needs 2-3GB. On 8GB total, that leaves almost nothing.

In practice, expect:

  • 2048-4096 tokens: Comfortable, fast, no issues
  • 8192 tokens: Tight. May work but monitor for slowdowns
  • 16K+: Forget it. You’ll hit OOM (out of memory) errors

For most conversations, 2-4K context is enough. But if you need to process long documents or maintain very long chat histories, 8GB will frustrate you.


What Won’t Work

Let’s save you the troubleshooting time:

  • 30B+ models: Doesn’t fit. Period. Not even at Q2. A 30B model at Q2 still needs ~12GB.
  • 70B models: Even with CPU offloading, inference drops to 1-3 tokens per second. That’s not usable — it’s a screensaver.
  • Fine-tuning: Needs 16GB minimum for LoRA on 7B models. Training is out of scope for 8GB.
  • Multiple models simultaneously: One model at a time. Loading a second will OOM your first.

If any of these are dealbreakers, skip ahead to the upgrade section.

The “runs on anything” option: 1.58-bit models

Two models worth knowing about if you’re exploring the extreme low end:

BitNet b1.58 2B4T — Microsoft’s ternary model uses weights of just {-1, 0, +1}. The non-embedding memory footprint is 0.4GB. It runs on CPU with no GPU required, 2-6x faster than comparable models on x86 processors. The catch: it requires Microsoft’s bitnet.cpp runtime. No Ollama, no LM Studio, no GGUF. Setup is manual. Quality is comparable to other 2B models — useful for simple tasks, not a replacement for 7B+.

Falcon-Edge 1.58-bit — TII’s ternary models come in 1B (665MB) and 3B (999MB). Same idea as BitNet: extreme compression for edge deployment. Benchmarks are lower than standard quantized models at the same parameter count, and they also need a custom runtime (no Ollama). Experimental, but the direction matters — ternary architectures may eventually make 8GB VRAM concerns irrelevant.

Neither model is ready for daily use today. But if you’re curious about where inference efficiency is headed, both are worth a look.


Image Generation on 8GB

Good news: 8GB handles image generation better than you might expect.

Stable Diffusion 1.5: Runs Great

SD 1.5 is the sweet spot for 8GB cards. The model uses ~4GB VRAM, leaving plenty of headroom. Generation is fast and the community ecosystem (LoRAs, checkpoints, ControlNet) is massive.

ResolutionTime (RTX 4060)Notes
512x512~5 secondsFast, plenty of headroom
768x768~10 secondsStill comfortable
1024x1024~18 secondsPushing it, but works

SDXL: Tight But Doable

SDXL produces noticeably better images but uses ~7-8GB VRAM for the base model alone. On 8GB, it works — with some conditions:

  • Use ComfyUI (more memory-efficient than AUTOMATIC1111)
  • Enable a FP16 VAE (drops VAE VRAM from ~6GB to under 1GB)
  • Enable xformers (25-30% speedup, slight memory savings)
  • Expect ~30-35 seconds per 1024x1024 image

The refiner model won’t fit alongside the base model. Run them sequentially, not simultaneously. For most people, the base model alone produces great results.

Flux: Limited

Flux’s schnell variant can technically run on 8GB with aggressive optimization, but it’s not a good experience. If Flux is your priority, you need 12GB+.


Best Models for 8GB GPUs (Ranked)

Here’s what to install first, in order:

1. Qwen 3.5 9B — The new default. Native vision, 262K context support, beats models 3x its size on reasoning. Tight fit at ~7GB but worth it.

ollama run qwen3.5:9b

2. Llama 3.1 8B Instruct — More VRAM headroom than Qwen 3.5 9B (~5.5GB). Still a strong all-rounder, especially if you need longer context on 8GB.

ollama pull llama3.1:8b

3. DeepSeek R1 8B — Best for reasoning and math. Uses a “thinking” approach that improves complex answers.

ollama pull deepseek-r1:8b

4. Mistral 7B — Fastest of the bunch at ~45 tok/s. Great for quick chat when speed matters more than benchmarks.

ollama pull mistral

5. Qwen 3.5 3B — When you need VRAM headroom. Leaves 5GB+ free for context, background apps, or running alongside image generation.

ollama pull qwen3.5:3b

New to Ollama? It’s one command to install, one command to run. If you prefer a visual interface, LM Studio works just as well.

→ Check what fits your hardware with our Planning Tool.


Tips to Squeeze Every Megabyte

1. Close Chrome (Seriously)

Chrome with a few tabs open can eat 500MB-1GB of VRAM for hardware acceleration. Close it before running models, or disable hardware acceleration in Chrome settings (chrome://settings/system).

2. Use Q4_K_S Instead of Q4_K_M

Q4_K_S is slightly smaller than Q4_K_M (about 200MB less for a 7B model) with minimal quality loss. On 8GB, that 200MB matters. It can be the difference between fitting with headroom and running out of memory.

For the difference between quantization formats, see our quantization guide.

3. Reduce Context Length

If you’re hitting memory limits, explicitly set a shorter context window:

# Ollama: set context to 2048 tokens
ollama run llama3.1:8b --num-ctx 2048

2048 tokens is plenty for most single-turn conversations. Only increase it if you actually need longer context.

4. Monitor VRAM Usage

Keep an eye on what’s happening:

# Check VRAM usage in real time
nvidia-smi -l 1

If you see VRAM hovering near 7.5-8GB, you’re on the edge. Consider a smaller model or lower quantization.

5. Kill Background GPU Processes

Video players, game launchers (Steam, Epic), and even some desktop compositors use VRAM. Check nvidia-smi for surprise consumers and close what you don’t need.


When to Upgrade (And What To)

You’ve outgrown 8GB when:

  • You constantly adjust context length to avoid OOM errors
  • You need 13B+ models for your work
  • You want to run image generation and LLMs without swapping
  • You’re waiting on CPU offloading and it’s painfully slow

Here’s the upgrade path, ranked by value:

GPUVRAMStreet Price (Jan 2026)What It UnlocksBest For
Used RTX 3060 12GB12GB~$20013B at Q4, comfortable 7B at Q6+Cheapest meaningful upgrade
RTX 5060 Ti 16GB16GB~$429-50030B quantized, 7B-13B at high quantsNew card sweet spot (if you can find one)
Used RTX 309024GB~$700-85070B quantized, fine-tuning small modelsVRAM king on a budget
RTX 4060 Ti 16GB16GB~$450-500 (used)Same as 5060 Ti 16GBPoor value vs. new 5060 Ti

The RTX 3060 12GB at ~$200 is the cheapest step up — 50% more VRAM for the price of a nice dinner. But if you’re serious about local AI and can stretch the budget, the used RTX 3090 at $700-850 remains the best VRAM-per-dollar card available. Twenty-four gigabytes opens up an entirely different tier of models.


The Bottom Line

8GB of VRAM is real. Not a demo. Not a toy. You can run capable 7B-8B language models at interactive speeds, generate images with Stable Diffusion, and do genuine local AI work — all without spending another dollar.

The practical advice:

  1. Install Ollama and run ollama run qwen3.5:9b. You’ll be chatting with a vision-capable model in under five minutes.
  2. Use Q4_K_M quantization and keep context to 4-8K tokens.
  3. Close Chrome and other VRAM hogs before running models.
  4. When 8GB isn’t enough — and you’ll know when — a used RTX 3090 is the move.

Your card is more capable than you think. Start using it.



Sources: DatabaseMart RTX 4060 Ollama Benchmark, LocalLLM.in 8GB VRAM Guide, XDA Developers Used RTX 3090, SDXL System Requirements, BestValueGPU Price Tracker