Running 70B Models Locally — Exact VRAM by Quantization

📚 More on this topic: VRAM Requirements Guide · Quantization Explained · Multi-GPU Guide · Mac vs PC for Local AI · Used RTX 3090 Buying Guide

Running a 70B model locally is the line between “hobby” and “serious local AI.” On the other side of that line is reasoning that competes with GPT-4 and the ability to process complex problems without sending your data to the cloud.

The barrier is VRAM. A 70B model at full precision needs 141GB of memory. No consumer GPU comes close to that. Quantization brings it down to 43GB at Q4, which still won’t fit on a single RTX 4090 or 3090. You need either two GPUs, a Mac with enough unified memory, or a workstation-class card.

This guide gives you exact VRAM numbers at every quantization level, which hardware setups actually work, realistic speed expectations, and an honest assessment of when 70B is worth the investment versus running a 32B model instead.

The 70B Math

The formula is simple:

VRAM (GB) = Parameters (billions) × Bytes per parameter

At FP16 (2 bytes per parameter): 70B × 2 = 140GB. That’s the model weights alone. Context and framework overhead are extra.

Quantization compresses those weights:

Precision	Bytes per Param	Weight Size (70B)	With Overhead*
FP16	2.0	140 GB	~142 GB
Q8_0	1.0	70 GB	~75 GB
Q6_K	0.75	52.5 GB	~58 GB
Q5_K_M	0.625	43.75 GB	~50 GB
Q4_K_M	0.5	35 GB	~43 GB
Q3_K_M	0.375	26.25 GB	~35 GB
Q2_K	0.25	17.5 GB	~27 GB

*Overhead includes KV cache at 4K context, framework memory, and CUDA/Metal context. Real GGUF files are slightly larger than the theoretical minimum due to metadata and mixed-precision layers.

The theoretical calculation gets you in the ballpark. Real file sizes are what matter. See the next section.

Exact VRAM: Llama 3.3 70B and Qwen 2.5 72B

These are the two 70B-class models most people run locally. Numbers from actual GGUF builds on HuggingFace:

Llama 3.3 70B Instruct

Quantization	File Size	VRAM Needed (4K ctx)	VRAM Needed (8K ctx)
FP16	141.1 GB	~143 GB	~148 GB
Q8_0	75.0 GB	~77 GB	~82 GB
Q6_K	57.9 GB	~60 GB	~65 GB
Q5_K_M	50.0 GB	~52 GB	~57 GB
Q4_K_M	42.5 GB	~45 GB	~50 GB
Q3_K_M	34.3 GB	~37 GB	~42 GB
Q2_K	26.4 GB	~29 GB	~34 GB

Qwen 2.5 72B Instruct

Quantization	File Size	VRAM Needed (4K ctx)	VRAM Needed (8K ctx)
Q8_0	77.3 GB	~79 GB	~84 GB
Q6_K	64.4 GB	~66 GB	~71 GB
Q5_K_M	54.5 GB	~57 GB	~62 GB
Q4_K_M	47.4 GB	~50 GB	~55 GB
Q3_K_M	37.7 GB	~40 GB	~45 GB
Q2_K	29.8 GB	~32 GB	~37 GB

Qwen 2.5 72B is about 10-15% larger than Llama 3.3 70B at the same quantization because it has 72 billion parameters versus 70.6 billion, plus slightly different architectural choices. Both produce similar quality at the same quant level.

For a deeper understanding of what these quantization levels mean and how they affect output quality, see our quantization explainer.

Context Length Eats Your VRAM

The tables above assume 4K or 8K context. But Llama 3.3 supports 128K tokens and Qwen 2.5 72B supports 128K too. The KV cache (where the model stores attention state for the conversation) grows linearly with context length.

KV Cache VRAM at 70B Scale

Context Length	KV Cache (FP16)	KV Cache (Q8)	KV Cache (Q4)
4K tokens	~2.4 GB	~1.2 GB	~0.6 GB
8K tokens	~4.9 GB	~2.4 GB	~1.2 GB
16K tokens	~9.8 GB	~4.9 GB	~2.4 GB
32K tokens	~14 GB	~7 GB	~3.5 GB
64K tokens	~28 GB	~14 GB	~7 GB
128K tokens	~39 GB	~20 GB	~11 GB

At 32K context with FP16 KV cache, you’re adding 14GB on top of the model weights. On a dual RTX 3090 setup (48GB total) running Llama 70B Q4_K_M (42.5GB file), that leaves about 5.5GB for everything. With a 14GB KV cache, you’re already over.

This is why most 70B setups run with 4K-8K context and why the 128K advertised context length is mostly theoretical for consumer hardware. You can extend it with quantized KV cache (Ollama and llama.cpp both support this), but even then, 32K+ context on 48GB total VRAM is tight.

Hardware That Can Actually Run 70B

Single Consumer GPUs: Mostly Can’t

GPU	VRAM	Best 70B Quant	Context	Verdict
RTX 4060 (8GB)	8 GB	None	—	Not happening
RTX 3060 12GB	12 GB	None	—	Not happening
RTX 4060 Ti 16GB	16 GB	None	—	Not happening
RTX 3090 / 4090	24 GB	None (Q2_K is 26.4GB)	—	Doesn’t fit even at Q2
RTX 5090	32 GB	Q2_K or Q3_K_M	~4K tokens	Technically works. Quality is poor at Q2, marginal at Q3.

The RTX 5090 is the only consumer GPU that can load a 70B model at all. Q3_K_M (34.3GB file) fits with about 4K context, but quality degrades noticeably at Q3 and you have zero headroom. It’s a proof-of-concept, not a daily driver.

Dual GPU Setups

This is where 70B becomes practical on consumer hardware. Two GPUs pool their VRAM.

Setup	Total VRAM	Best Quant	Context	Speed	Cost (Feb 2026)
2× RTX 3090	48 GB	Q4_K_M	~4-8K	16-21 tok/s	~$1,700
2× RTX 4090	48 GB	Q4_K_M	~4-8K	20-25 tok/s	~$3,200+
2× RTX 5090	64 GB	Q4_K_M	~16-32K	25-30 tok/s	~$4,000+

Dual RTX 3090s ($1,700 total) is the budget path. 48GB runs Llama 3.3 70B at Q4_K_M with 4-8K context. You get 16-21 tokens per second, which is readable but noticeably slower than the 40+ tok/s you’d get from a 32B model on a single card. See our multi-GPU guide for setup instructions.

Dual RTX 5090s ($4,000+) with 64GB total opens up longer context. Q4_K_M with 16-32K tokens is comfortable, and Q5_K_M becomes viable for better quality.

Both setups require a motherboard with two PCIe x16 slots (or at least x16 + x8), a 1000W+ power supply, and good airflow. Two 3090s at full inference draw 700+ watts combined.

Workstation / Datacenter GPUs

GPU	VRAM	Best Quant	Context	Speed	Price
A6000	48 GB	Q4_K_M	~4-8K	12-16 tok/s	~$2,200 used
A100 80GB	80 GB	Q5_K_M	~16K+	19-22 tok/s	~$8,000+ used

The A6000 at $2,200 used gives you the same 48GB as dual 3090s in a single card, no multi-GPU hassle. But it’s slower for inference (smaller memory bandwidth) and costs $500 more.

Mac (Unified Memory)

This is where Macs win. Unified memory lets the entire RAM pool serve as model memory.

Config	Unified Memory	Best Quant	Context	Speed	Price
Mac Mini M4 Pro 48GB	48 GB	Q4_K_M	~4-8K	6-8 tok/s	~$1,900
Mac Studio M4 Max 64GB	64 GB	Q4_K_M	~16K	8-10 tok/s	~$2,500
Mac Studio M4 Max 128GB	128 GB	Q6_K	~32K+	10-12 tok/s	~$4,100

Mac speeds are slower than dual NVIDIA GPUs because unified memory bandwidth (273-546 GB/s) is lower than GDDR6X (936 GB/s per 3090). But the Mac loads the model at all, which a single 24GB GPU can’t. And it does it silently, at 15 watts idle.

The M4 Max 128GB at $4,100 is the most comfortable 70B experience. Q6_K with long context, no fan noise, no multi-GPU setup. The tradeoff is speed. See our Mac vs PC comparison for the full breakdown.

Speed Expectations

70B models are slow. Set your expectations accordingly.

Hardware	Llama 3.3 70B Q4_K_M	Context
2× RTX 3090 (48GB)	16-21 tok/s	4-8K
2× RTX 4090 (48GB)	20-25 tok/s	4-8K
2× RTX 5090 (64GB)	25-30 tok/s	16K
Mac M4 Max 128GB	10-12 tok/s	32K
Mac M4 Max 64GB	8-10 tok/s	8K
A100 80GB	19-22 tok/s	16K
Single 24GB GPU + CPU offload	1-5 tok/s	4K

For comparison, Qwen 3 32B at Q4_K_M on a single RTX 3090 runs at 35-45 tok/s. A 70B model on dual 3090s runs at about half that speed while costing twice the hardware.

CPU offloading (splitting the model between GPU and system RAM) technically works but is painfully slow. The PCIe bus becomes the bottleneck, dropping generation to 1-5 tok/s. At that speed, you’re waiting 10-20 seconds for a single sentence. It’s fine for testing. It’s not usable for daily work.

Quality vs Quantization at 70B

Good news: 70B models tolerate quantization better than smaller models. Research confirms that models above 30B parameters retain ~99% of FP16 accuracy at 4-bit quantization, while 7B models lose 2-5%.

Quality by Quant Level

Quantization	Quality Retention	Best For
Q8_0	~99.5% of FP16	Maximum quality when VRAM allows
Q6_K	~99% of FP16	Excellent. Hard to distinguish from Q8 in practice
Q5_K_M	~97-99% of FP16	Great balance. Most users won’t notice the difference
Q4_K_M	~95-97% of FP16	The sweet spot. Minor degradation on complex reasoning
Q3_K_M	~90-93% of FP16	Noticeable. Reasoning and math tasks suffer first
Q2_K	~80-85% of FP16	Severe. Unpredictable behavior on hard problems. Skip this.

Q4_K_M is the recommendation for almost everyone running 70B locally. The 3-5% quality loss versus FP16 is barely perceptible in normal use. You’d need benchmark suites to measure the difference reliably. The VRAM savings (142GB down to 43GB) make it the only practical option on consumer hardware.

Q3_K_M is where you start noticing. Math problems that Q4 handles cleanly will occasionally fail at Q3. Multi-step reasoning chains break more often. If you’re running on an RTX 5090 and Q3 is your only option, it works. Just know you’re leaving quality on the table.

Q2_K is not worth running. At 70B, even the higher quantization tolerance can’t save Q2 from significant output degradation. If Q2 is your only option, run a 32B model at Q4 instead. You’ll get better results.

When 70B Is Worth It

Run 70B For:

Complex reasoning. Multi-step logic problems, mathematical proofs, scientific analysis. The gap between 32B and 70B is widest here. A 70B model at Q4 catches errors and follows chains of reasoning that a 32B model misses.

Deep research and analysis. Summarizing long documents, comparing multiple sources, identifying inconsistencies. 70B models have broader knowledge and make fewer factual errors.

Nuanced writing. When you need precise tone control, subtle arguments, or professional-grade output. 70B models handle ambiguity and subtext better.

Skip 70B For:

Quick chat and Q&A. A 32B model answers “what’s the capital of France” just as correctly, 3-4x faster.

Simple code generation. For boilerplate, function scaffolding, and straightforward coding tasks, 32B coding models (Qwen 2.5 Coder 32B, DeepSeek-Coder-V2) are more than sufficient and much faster.

Anything speed-sensitive. If you need responses in under 2 seconds, 70B won’t deliver. A 32B model at 40 tok/s starts generating immediately. A 70B model at 15 tok/s has noticeable latency.

The 32B Alternative

This is the honest question: in 2026, do you need 70B?

	Qwen 3 32B (Q4_K_M)	Llama 3.3 70B (Q4_K_M)
VRAM needed	~20 GB	~43 GB
Hardware	Single RTX 3090 ($850)	Dual RTX 3090 ($1,700)
Speed	35-45 tok/s	16-21 tok/s
Benchmark quality	~85-90% of 70B	Baseline
Complex reasoning	Good	Better
Creative writing	Competitive (85% human preference)	Good
Coding	Strong (DeepSeek R1 Distill 32B leads some benchmarks)	Strong

The gap has narrowed. Qwen 3 32B and DeepSeek-R1-Distill-Qwen-32B compete with 70B models on many benchmarks while using less than half the VRAM. On creative writing, Qwen 3 32B actually gets 85% human preference over larger models. On coding, DeepSeek R1 Distill 32B leads Llama 3.3 70B on several benchmarks.

70B still wins on complex multi-step reasoning and factual depth. If that’s your primary use case, the hardware investment is justified. For everything else, a 32B model on a single GPU is faster and cheaper, with nearly the same quality.

Bottom Line

Running 70B locally requires either dual GPUs (2× RTX 3090 at $1,700), a Mac with 64GB+ unified memory ($2,000+), or a datacenter card. Q4_K_M is the quantization sweet spot: 43GB for Llama 3.3 70B, excellent quality retention. Below Q4, quality drops noticeably. Below Q3, don’t bother.

The practical setup for most people: dual RTX 3090s with Llama 3.3 70B at Q4_K_M. You get 16-21 tok/s with 4-8K context. It’s slower than a 32B model and costs twice the hardware. But for complex reasoning and research, the quality difference is real.

If you’re not sure whether you need 70B, start with Qwen 3 32B on a single 24GB GPU. It handles 80-90% of tasks just as well. Upgrade to 70B when you consistently hit the quality ceiling on reasoning-heavy work.

How Much VRAM Do You Need? — full VRAM chart for every model size
What Quantization Actually Means — how Q4, Q6, Q8 affect quality
Multi-GPU Local AI — tensor vs pipeline parallelism, setup guides
Used RTX 3090 Buying Guide — the best 24GB GPU for budget AI
Mac vs PC for Local AI — unified memory vs discrete VRAM
Llama 3 Guide: Every Size — which Llama to pick
Building a Distributed AI Swarm — multi-node alternative
What Can You Run on 24GB VRAM? — the 32B sweet spot