VRAM
Why Your Local LLM Is Slow: The num_ctx VRAM Overflow Nobody Warns You About
DeepSeek-R1 14B went from 35 tok/s to 4.8 tok/s on the same GPU. The fix was one parameter. How num_ctx silently overflows VRAM and kills inference speed.
Qwen 3.5 Small Models: The 9B Beats Last-Gen 30B — Here's What Matters for Local AI
Alibaba's Qwen 3.5 drops 4 small models (0.8B to 9B) — all natively multimodal, 262K context, Apache 2.0. The 9B beats Qwen3-30B on reasoning and destroys GPT-5-Nano on vision. VRAM tables and what to run.
Best 8GB GPU Model: How to Set Up Qwen 3.5 9B (Step by Step)
Qwen 3.5 9B fits in 6.6GB and beats models 3x its size. Complete setup with Ollama, benchmarks, and real-world testing on RTX 3060 and 4060.
RTX 5060 Ti Review for Local AI — The New Budget King
Real benchmarks for the RTX 5060 Ti 16GB running local LLMs. Qwen 3.5 35B at 44 tok/s, 100K context for ~$430. Compared against RTX 3060, 3090, and 4060 Ti.
Best Qwen 3.5 Models Ranked: Every Size, Every GPU, Every Quant
Complete ranking of all Qwen 3.5 models from 0.8B to 397B. VRAM requirements, speed benchmarks, and which model to pick for your hardware.
Qwen 3.5 Locally — 27B vs 35B-A3B vs 122B, Which Model Fits Your GPU
Qwen 3.5 27B dense vs 35B-A3B MoE vs 122B-A10B compared for local inference. VRAM tables, tok/s benchmarks on RTX 3090 and Mac, thinking mode setup, and which to pick for your hardware.
The 8GB VRAM Trap: What 'Runs on 8GB' Actually Means
Every local AI tutorial says 'runs on 8GB!' — and technically it does. What they don't tell you about quantization cliffs, tiny context windows, and why a $275 used GPU changes everything.
Best Qwen 3.5 Setup: Which Model Fits Your GPU (Complete Cheat Sheet)
Pick the right Qwen 3.5 model for your hardware. Covers 0.8B through 397B with VRAM requirements, quant recommendations, and benchmarks for every GPU tier.
MoE Models Explained: Why Mixtral Uses 46B Parameters But Runs Like 13B
Mixture of Experts explained for local AI — why MoE models run fast but still need full VRAM. Mixtral, DeepSeek V3, DBRX compared with dense model alternatives.
KV Cache: Why Context Length Eats Your VRAM (And How to Fix It)
The KV cache is why your 8B model OOMs at 32K context. Full formula, worked examples for popular models, and 6 optimization techniques to cut KV VRAM usage.
Building AI Agents with Local LLMs: A Practical Guide
Build AI agents with local LLMs using Ollama and Python. Model requirements, VRAM budgets, framework comparison, working code example, and security warnings.
SmarterRouter: A VRAM-Aware LLM Gateway for Your Local AI Lab
Intelligent router that profiles your models, manages VRAM, caches responses semantically, and auto-picks the best model per prompt. Works with Ollama and llama.cpp.
Memory Leak in Long Conversations: Causes and Fixes
VRAM climbs with every message until your model crashes? It's probably KV cache growth, not a leak. How to diagnose, monitor, and fix memory issues in local LLMs.
CUDA Out of Memory: What It Means and How to Fix It
CUDA out of memory means your model doesn't fit in VRAM. Seven fixes ranked by effort — context length, KV cache quantization, model quant, CPU offload — with tool-specific commands for Ollama, llama.cpp, and LM Studio.
Context Length Exceeded: What To Do When Your Model Runs Out of Space
Model forgetting earlier messages or throwing context errors? How context length works, what happens when it fills, and practical fixes for chat, RAG, and coding.
Mixtral VRAM Requirements: 8x7B and 8x22B at Every Quantization Level
Mixtral 8x7B has 46.7B params but only 12.9B activate per token. You still need VRAM for all 46.7B. Exact VRAM for every quant from Q2 to FP16.
Qwen3 Complete Guide: Every Model from 0.6B to 235B
Qwen3 is the best open model family for budget local AI. Dense models from 0.6B to 32B, MoE models that punch above their weight, and a /think toggle no one else has.
Llama 4 vs Qwen3 vs DeepSeek V3.2: Which to Run Locally in 2026
Llama 4 needs 55GB. DeepSeek V3.2 needs 350GB. Qwen3 runs on 8GB. Here's who wins at each VRAM tier and use case for local AI in 2026.
Llama 4 Guide: Running Scout and Maverick Locally
Complete Llama 4 guide for local AI — Scout (109B MoE, 17B active) and Maverick (400B). VRAM requirements, Ollama setup, benchmarks, and honest hardware reality check.
GPT-OSS Guide: OpenAI's First Open Model for Local AI
GPT-OSS 20B is OpenAI's first open-weight model. MoE with 3.6B active params, MXFP4 at 13GB, 128K context, Apache 2.0. Here's how to run it.
DeepSeek V3.2 Guide: What Changed and How to Run It Locally
DeepSeek V3.2 competes with GPT-5 on benchmarks. The full model needs 350GB+ VRAM. But the R1 distills run on a $200 used GPU — and they're shockingly good.
Running 70B Models Locally — Exact VRAM by Quantization
Llama 3.3 70B needs 43GB at Q4, 75GB at Q8, 141GB at FP16. Here's every quant level, which GPUs fit, real speeds, and when 32B is the smarter choice.
Mixtral 8x7B & 8x22B VRAM Requirements
Mixtral 8x7B and 8x22B VRAM requirements at every quantization level. Exact numbers from Q2 to FP16, GPU recommendations, and KV cache impact explained.
Local AI Video Generation: What Works in 2026
Wan 2.2 leads on quality, LTX-Video renders 5-second clips in 4 seconds, and 12GB VRAM is the minimum. Speed benchmarks, VRAM charts, and setup for 7 models on consumer GPUs.
Are Mistral Models Still Worth Running? Only Nemo 12B (Here's Why)
Mistral led local AI in 2024. In 2026, Qwen 3 and Llama 3 have passed them on most benchmarks. The exception: Mistral Nemo 12B with 128K context still earns its slot. What's worth running, what's been replaced, and when to pick Mistral over the competition.
Qwen Models Guide: The AI Family You're Missing
Complete Qwen models guide covering Qwen 3.5, Qwen 3, Qwen 2.5 Coder, and Qwen-VL. VRAM requirements, Ollama setup, Gated DeltaNet architecture, and benchmarks vs Llama and DeepSeek.
Llama 3 Guide: Every Size from 1B to 405B
Complete Llama 3 guide covering every model from 1B to 405B. VRAM requirements, Ollama setup, benchmarks vs Qwen 3, and which size fits your hardware.
Laptop vs Desktop for Local AI: Which Should You Buy?
A $750 desktop RTX 3090 gives you 24GB VRAM. The same money in a gaming laptop gets 8GB. MacBooks break the rules with 48GB+ unified memory for 70B models.
What Can You Actually Run on 16GB VRAM?
13B-14B models hit 22-53 tok/s at Q4-Q6, Flux runs at FP8, and 20B models squeeze in with short context. Where 16GB beats 12GB, where it trails 24GB, and the best cards at this tier.
Context Length Explained: Why It Eats Your VRAM
What context length actually means for local LLMs, how it affects VRAM usage, practical limits for different hardware, and when you actually need 128K+ tokens.
What Can You Actually Run on 24GB VRAM?
32B models at 25-38 tok/s, 70B at Q3 with limited context, Flux at full FP16, and LoRA fine-tuning. RTX 3090 at $700 vs 4090 at $1,800—every model that fits and which GPU to buy.
What Can You Actually Run on 8GB VRAM?
7B-8B models hit 35-42 tok/s at Q4, SD 1.5 runs great, SDXL is tight but doable. Nothing above 13B fits. Every model that works on RTX 4060 and 3060 Ti, plus the best upgrade path.
What Can You Actually Run on 12GB VRAM?
14B models at Q4 hit 25-32 tok/s, 7B-8B run at near-lossless Q6-Q8, and SDXL generates without workarounds. Every model that fits on an RTX 3060 12GB and the best upgrade path.
Quantization Explained: What It Means for Local AI
Q4_K_M shrinks a 7B model from 14GB to ~4GB while keeping 90-95% quality. What every quantization format means, how much VRAM each saves, and which to pick for your GPU.
Best VRAM Cheat Sheet for Local LLMs: Every Model, Every Quant
Exact VRAM for Qwen 3.5, Llama, Mistral, and DeepSeek at Q3 through FP16. Lookup tables for 7B, 9B, 13B, 27B, 32B, 70B, and 120B models with real measurements and GPU recommendations. Updated March 2026.
RTX 5060 Ti 16GB Killed? Local AI Alternatives
The RTX 5060 Ti 16GB faces production cuts from GDDR7 shortages. See what is really happening and explore the best alternative GPUs for local AI in 2026.