Local AI Planning Tool

Figure out what you need to run AI locally — whether you're starting with hardware, a model, or a problem to solve.

🔧 I Have Hardware

🧠 I Want a Model

🎯 I Need to Solve a Problem

Select Your Hardware

👆 Pick your device to see what models fit and what you can build

Or enter available memory (GB)

Quantization

Model

Params (B)

Layers

KV Heads

Head Dim

Quantization

Batch Size

concurrent requests

Context Length

What do you need AI to do?

👆 Pick a use case to see recommended models, hardware requirements, and setup steps

Why This Tool Exists

Most VRAM calculators ask you to pick a model and show you a number. But most people don't start with a model — they start with hardware they already own, or a problem they need to solve. This tool works all three ways.

"I have hardware" shows every model that fits your device, grouped by what you can build with it. "I want a model" gives precise VRAM calculations using real architecture specs. "I need to solve a problem" recommends the right model and stack for your use case, with exact hardware requirements.

The formula is empirically validated against real llama.cpp measurements: VRAM = (P × bw) + (0.55 + 0.08 × P) + KV cache, where KV cache uses each model's actual layer count, KV head count, and head dimension.

Common Questions

Can I run LLMs on a Raspberry Pi? Yes — Gemma 3 1B and Qwen 0.5B run on a Pi 5 with 8GB RAM. Expect 5-15 tokens/sec on CPU. Best for simple tasks, classification, or edge voice assistants.

What's the best GPU for local AI on a budget? A used RTX 3090 (24GB, ~$600-700) is the sweet spot. Runs 32B models at Q4 and 70B at Q2-Q3.

Do MoE models use less VRAM? No. All expert parameters must reside in VRAM. MoE models run faster per-token but use the same memory as a dense model of equal total parameter count.

How does context length affect VRAM? KV cache grows linearly with context. At 128K tokens on Llama 8B, KV cache alone is ~8 GB — more than the model weights. Always check VRAM at your target context length.

What about phones and edge devices? Gemma 3n was designed for mobile, running with as little as 2-3 GB. Combined with Whisper for speech-to-text, you can build a fully offline voice assistant on a modern phone.

📚 VRAM Requirements Guide · GPU Buying Guide · Quantization Explained · What Runs on 24GB? · 16GB? · 8GB? · Coding Models · Voice Chat · RAG Guide