Guides

160+ practical guides for running AI locally — from first install to advanced optimization.

Most Popular

Apple M5 Pro and M5 Max: What 4x Faster LLM Processing Actually Means for Local AI M5 Pro hits 307GB/s, M5 Max doubles to 614GB/s. Neural Accelerators in every GPU core. 128GB runs 70B+ models on a laptop. What actually changes for local AI.
Best 8GB GPU Model: How to Set Up Qwen 3.5 9B (Step by Step) Qwen 3.5 9B fits in 6.6GB and beats models 3x its size. Complete setup with Ollama, benchmarks, and real-world testing on RTX 3060 and 4060.
Qwen 3.5 Small Models: The 9B Beats Last-Gen 30B — Here's What Matters for Local AI Alibaba's Qwen 3.5 drops 4 small models (0.8B to 9B) — all natively multimodal, 262K context, Apache 2.0. The 9B beats Qwen3-30B on reasoning and destroys GPT-5-Nano on vision. VRAM tables and what to run.
Best Anime and Stylized Checkpoints for Local Image Generation (2026) Illustrious XL, NoobAI-XL, Animagine, Pony Diffusion, and SD 1.5 anime models compared. VRAM requirements, Danbooru prompting, LoRA picks, and settings for ComfyUI and A1111.
Best Photorealism Checkpoints for Local Image Generation (2026) Juggernaut XL, RealVisXL, Realistic Vision, and Flux compared for photorealistic AI images. VRAM requirements, recommended settings, sample prompts, and installation for ComfyUI and A1111.
Replace GitHub Copilot With Local LLMs in VS Code — Free, Private, No Subscription Set up free, private AI code completion in VS Code with Continue + Ollama. Autocomplete, chat, and agentic coding with Qwen models at every VRAM tier. Step-by-step setup, model picks, honest tradeoffs.
Best Qwen 3.5 Models Ranked: Every Size, Every GPU, Every Quant Complete ranking of all Qwen 3.5 models from 0.8B to 397B. VRAM requirements, speed benchmarks, and which model to pick for your hardware.
DeepSeek V4: Everything We Know Before It Drops DeepSeek V4 launches next week with native image and video generation, 1M context, and rumored 1T MoE params with only 32B active. Here's what local AI builders need to know and how to prepare.
OpenClaw Security Report: February 2026 — ClawHub Malware, Google Suspensions, and Critical Fixes 17 security fixes, 341 malicious ClawHub skills, Google banning users, and the creator leaving. Every OpenClaw security event from February 2026.
Best Qwen 3.5 Setup: Which Model Fits Your GPU (Complete Cheat Sheet) Pick the right Qwen 3.5 model for your hardware. Covers 0.8B through 397B with VRAM requirements, quant recommendations, and benchmarks for every GPU tier.
Best Local Alternatives to Claude Code in 2026 Aider, Continue.dev, Cline, OpenCode, Void, and Tabby compared. Which open-source coding tools work best with local models on your own GPU?
Best OpenClaw Alternatives: 7 Tools That Actually Work in 2026 Tested alternatives to OpenClaw for local AI agent workflows. Ranked by setup ease, model support, and what actually works without cloud dependencies.
Best OpenClaw Tools and Extensions in 2026 Crabwalk visualizes agent actions in real time, Tokscale catches API bills before they hit $200+, and openclaw-docker locks down deployment. The best 3rd-party tools ranked.
Best Local LLMs for Mac in 2026 — M1, M2, M3, M4 Tested The best models to run on every Mac tier. Specific picks for 8GB M1 through 128GB M4 Max, with real tok/s numbers. MLX vs Ollama vs LM Studio compared.
OpenClaw ClawHub Alert: 1,103 Malicious Skills Found OpenClaw ClawHub security alert: 1,103 malicious skills found across 14,706 audited. CVE-2026-28458 Browser Relay auth bypass. How to protect yourself now.
OpenClaw Token Optimization: Cut Costs 97% Cut OpenClaw API costs by 97% with three proven fixes: route heartbeats through Ollama, add tiered model routing, and purge session history token bloat.
Best Local Models for OpenClaw Agent Tasks Qwen 3.5 27B on 24GB VRAM is the sweet spot for local agents — SWE-bench 72.4, 262K context, tool calling fixed in Ollama v0.17.6+. Model picks by VRAM tier and the 'society of minds' setup power users run.
Fastest Local LLM Setup: Ollama vs vLLM vs llama.cpp Real Benchmarks vLLM handles 4x the concurrent load of Ollama on identical hardware. But for single-user local use, Ollama is all you need. Benchmarks, memory usage, and a dead-simple decision framework. Updated for Ollama v0.17.7, vLLM v0.17.0, and llama.cpp with MCP support.
Mac Runs 70B Models That Need Multi-GPU on PC — Here's How Your M4 Max loads models that cost $3,000 in GPUs on PC. M1 with 8GB handles 7B, M4 Pro with 48GB runs 32B, and 128GB loads 70B+. MLX vs Ollama speeds tested, plus Mac Mini as a 24/7 AI server.
OpenClaw Setup Guide: Run a Local AI Agent Run `npx openclaw@latest`, scan a QR code for WhatsApp, and your AI agent is live. Gateway needs just 2-4GB RAM. Add Ollama for local models or connect Claude/GPT-4 via API.
Best Local Coding Models Ranked: Every VRAM Tier, Every Benchmark (2026) The best local LLMs for coding in 2026, ranked by VRAM tier. Benchmarks, editor setup, and practical recommendations for developers replacing Copilot.
Best VRAM Cheat Sheet for Local LLMs: Every Model, Every Quant Exact VRAM for Qwen 3.5, Llama, Mistral, and DeepSeek at Q3 through FP16. Lookup tables for 7B, 9B, 13B, 27B, 32B, 70B, and 120B models with real measurements and GPU recommendations. Updated March 2026.
Ollama vs LM Studio: Which Should You Use for Local AI? Ollama gives you a CLI with 100+ models and an OpenAI-compatible API. LM Studio gives you a visual GUI with one-click downloads. Most power users run both—here's when to use each.
Run Your First Local LLM in 15 Minutes Install Ollama, pull a model, and chat with AI offline—all in 15 minutes. Works on any Mac, Windows, or Linux machine with 8GB RAM. No accounts, no API keys, no fees.
GPU Buying Guide for Local AI: Pick the Right Card The complete GPU buying guide for local AI. Covers RTX 3060 through 4090 with VRAM analysis, performance benchmarks, prices, and used vs new buying advice.

Getting Started (3)

Ubuntu 26.04 Is Built for Local AI — What Actually Changes Ubuntu 26.04 LTS packages NVIDIA CUDA and AMD ROCm in official repos. No more external downloads or dependency nightmares. What's confirmed and what it means for local AI.
Qwen 3.5 Locally — 27B vs 35B-A3B vs 122B, Which Model Fits Your GPU Qwen 3.5 27B dense vs 35B-A3B MoE vs 122B-A10B compared for local inference. VRAM tables, tok/s benchmarks on RTX 3090 and Mac, thinking mode setup, and which to pick for your hardware.
LiquidAI LFM2: The First Hybrid Model Built for Your Hardware LFM2-24B-A2B runs at 112 tok/s on CPU with only 2.3B active params. Not a transformer. GGUF files from 13.5GB, Ollama and llama.cpp setup, and where it beats Qwen.

Hardware & GPUs (47)

Run LLMs on Old Phones: A Practical Guide to Mobile AI Inference That old Pixel 6 or Galaxy S21 in your drawer can run a local LLM. Realistic tok/s by phone tier, Termux setup, app options, and an honest phone vs Raspberry Pi comparison.
Intel Arc B580 for Local LLMs: 12GB VRAM at $250, With Caveats The Arc B580 gives you 12GB VRAM for $250, but Intel's AI software stack needs work. Real tok/s benchmarks, setup paths, and honest comparison with RTX 3060.
Apple Neural Engine for LLM Inference: What Actually Works Apple Silicon has a dedicated Neural Engine that most LLM tools ignore. Here's what it can do for inference, what it can't, and whether ANE-based tools like ANEMLL are worth trying today.
ROCm vs CUDA for Local AI in 2026: The Software Gap Nobody Talks About AMD GPUs have the bandwidth. They have the VRAM. They still lose by 2x on inference speed. Here's why, what actually works on ROCm 7.2, and whether RDNA 4 fixes anything.
Apple M5 Pro and M5 Max: What 4x Faster LLM Processing Actually Means for Local AI M5 Pro hits 307GB/s, M5 Max doubles to 614GB/s. Neural Accelerators in every GPU core. 128GB runs 70B+ models on a laptop. What actually changes for local AI.
RTX 5060 Ti Review for Local AI — The New Budget King Real benchmarks for the RTX 5060 Ti 16GB running local LLMs. Qwen 3.5 35B at 44 tok/s, 100K context for ~$430. Compared against RTX 3060, 3090, and 4060 Ti.
What Can You Run on 8GB Apple Silicon? Local AI on a Budget Mac Llama 3.2 3B runs at 30 tok/s. Phi-4 Mini fits with room to spare. 7B models technically load but swap to disk. Honest benchmarks and real limits for 8GB M1/M2/M3/M4 Macs.
Ubuntu 26.04 Is Built for Local AI — What Actually Changes Ubuntu 26.04 LTS packages NVIDIA CUDA and AMD ROCm in official repos. No more external downloads or dependency nightmares. What's confirmed and what it means for local AI.
Mac Studio for Local AI: Is It Worth the Price? Mac Studio M4 Max (128GB) and M3 Ultra (up to 512GB) tested for local LLMs. Real tok/s numbers, cost comparison vs dual RTX 3090, and who should actually buy one.
Used Server GPUs for Local AI: Tesla P40, V100, A100, and the eBay Goldmine A Tesla P40 has 24GB VRAM for $175. A V100 has 32GB for $350. Server GPUs offer insane VRAM per dollar for local AI — if you can handle the quirks. Full breakdown with prices, benchmarks, and cooling fixes.
Intel Arc GPUs for Local AI: The Underdog Option That Actually Works The Arc A770 16GB gives you 16GB of VRAM for ~$250 used. Software support through IPEX-LLM and llama.cpp SYCL is real but rough. Honest benchmarks, what works, and what doesn't.
Used Tesla P40 for Local AI: The $200 Budget Beast 24GB VRAM for $150-$200 on eBay. Pascal architecture, no display output, passive cooling. Full benchmarks, setup guide, and honest comparison to the RTX 3060 and 3090.
RTX 5090 for Local AI: Worth the Upgrade? 32GB GDDR7, 1,792 GB/s bandwidth, 67% faster than 4090 — but $3,500+ street price. Full benchmarks, value analysis, and who should actually buy one.
RTX 4090 vs Used RTX 3090 for Local AI: Which to Buy in 2026 Both have 24GB VRAM. One costs 2-3x more. RTX 4090 vs used RTX 3090 — real benchmarks, real prices, and who should buy which for local LLM inference and image generation.
M4 Max and M3 Ultra for Local LLMs: Apple Silicon in 2026 No M4 Ultra exists. Apple's Mac Studio pairs the M4 Max (128GB, 546 GB/s) with an M3 Ultra (192GB, 800 GB/s). Real benchmarks, pricing, and who should buy which for local AI.
Best Mini PCs for Local AI Under $300 in 2026 A $200 refurbished ThinkCentre runs 7B models at 5-8 tok/s. A $350 AMD Ryzen box hits 10-15 tok/s. Specific picks, real benchmarks, and what's worth buying.
Mac Mini M4 for Local AI: Which Config to Buy and What It Actually Runs Mac Mini M4 Pro 48GB runs Qwen3-32B at 15-22 tok/s, draws 40W under load, and costs $25/year in electricity. Which config to buy and what each runs.
Running 70B Models Locally — Exact VRAM by Quantization Llama 3.3 70B needs 43GB at Q4, 75GB at Q8, 141GB at FP16. Here's every quant level, which GPUs fit, real speeds, and when 32B is the smarter choice.
Free Local AI vs Paid Cloud APIs: Real Cost Comparison An RTX 3090 pays for itself in 2 weeks of moderate API usage. Full break-even math for local vs OpenAI, Anthropic, and Google APIs with current 2026 pricing.
RTX 3060 vs 3060 Ti vs 3070 for Local AI The RTX 3060 has 12GB VRAM, the 3060 Ti and 3070 only have 8GB. For LLMs, the cheapest card wins — it runs 14B models the others can't fit. Speeds, prices, and when the 3070 still makes sense.
Multi-GPU Setups for Local AI: Worth It? Dual RTX 3090s cost $1,600+ and need a 1,200W PSU — but a single 3090 at $800 runs every model under 32B. When two GPUs actually beat one bigger card, and when they don't.
Razer AIKit Guide: Multi-GPU Local AI on Your Desktop Open-source Docker stack bundling vLLM, Ray, LlamaFactory, and Grafana into 1 container. Auto-detects GPUs, supports 280K+ HuggingFace models, and handles multi-GPU parallelism.
Multi-GPU Local AI: Run Models Across Multiple GPUs Dual RTX 3090s give you 48GB VRAM and run 70B models at 16-21 tok/s—vs 1 tok/s with CPU offloading. Tensor vs pipeline parallelism, setup guides, and real scaling numbers.
GB10 Boxes Compared: DGX Spark vs Dell vs ASUS vs MSI DGX Spark, Dell Pro Max, ASUS Ascent GX10, and MSI EdgeXpert compared with real benchmarks, 45-minute thermal tests, and pricing. Same chip, different chassis.
Best Local LLMs for Mac in 2026 — M1, M2, M3, M4 Tested The best models to run on every Mac tier. Specific picks for 8GB M1 through 128GB M4 Max, with real tok/s numbers. MLX vs Ollama vs LM Studio compared.
RTX 3090 vs 4070 Ti Super for Local LLMs Head-to-head comparison of the RTX 3090 and RTX 4070 Ti Super for running LLMs locally. Covers VRAM, speed, power, price, and which to buy for your use case.
How Much Does It Cost to Run LLMs Locally? $200-800 for hardware, $5-15/month in electricity, and a 3-6 month breakeven vs ChatGPT Plus at $240/year. Full cost breakdown with real numbers.
Best Used GPUs for Local AI: 2026 Buying Guide RTX 3090 at $700-850 for 24GB, RTX 3060 12GB at $170-220, RTX 3080 at $350-400. Tier rankings, fair prices, what to avoid (skip the 8GB 3070), and where to buy safely.
Best GPU Under $500 for Local AI (2026 Picks) Find the best GPU under $500 for running local AI in 2026. RTX 4060 Ti 16GB, used RTX 3080, RTX 3060 12GB, and RX 7700 XT compared with real benchmarks.
Best GPU Under $300 for Local AI (2026 Picks) Find the best GPU under $300 for local AI. We compare the RTX 3060 12GB, RX 7600, and Intel Arc B580 with VRAM analysis, LLM benchmarks, and real pricing.
Mac Runs 70B Models That Need Multi-GPU on PC — Here's How Your M4 Max loads models that cost $3,000 in GPUs on PC. M1 with 8GB handles 7B, M4 Pro with 48GB runs 32B, and 128GB loads 70B+. MLX vs Ollama speeds tested, plus Mac Mini as a 24/7 AI server.
Laptop vs Desktop for Local AI: Which Should You Buy? A $750 desktop RTX 3090 gives you 24GB VRAM. The same money in a gaming laptop gets 8GB. MacBooks break the rules with 48GB+ unified memory for 70B models.
What Can You Actually Run on 4GB VRAM? 1B-3B models run at 18-55 tok/s. Qwen 2.5 3B at Q4 is the sweet spot for chat and simple coding. 7B models don't fit. What works on GTX 1050 Ti and 1650, and when to upgrade.
What Can You Actually Run on 16GB VRAM? 13B-14B models hit 22-53 tok/s at Q4-Q6, Flux runs at FP8, and 20B models squeeze in with short context. Where 16GB beats 12GB, where it trails 24GB, and the best cards at this tier.
Used GPU Buying Guide for Local AI: How to Buy Smart RTX 3060 12GB for ~$200, RTX 3090 24GB for ~$750—used GPUs offer 2-3x the VRAM per dollar vs new. Fair prices, scam red flags, and where to buy safely.
Mac vs PC for Local AI: Which Should You Choose? RTX 3090 runs 7B-14B models 2-3x faster than M4 Pro. M4 Max with 128GB loads 70B models a PC can't touch. Real benchmarks, prices, and which platform fits your use case.
What Can You Actually Run on 24GB VRAM? 32B models at 25-38 tok/s, 70B at Q3 with limited context, Flux at full FP16, and LoRA fine-tuning. RTX 3090 at $700 vs 4090 at $1,800—every model that fits and which GPU to buy.
CPU-Only LLMs: What Actually Works Running CPU-only LLMs without a GPU — what actually works. Best model picks, real speed benchmarks, and a budget dual Xeon server build for 70B models.
What Can You Actually Run on 8GB VRAM? 7B-8B models hit 35-42 tok/s at Q4, SD 1.5 runs great, SDXL is tight but doable. Nothing above 13B fits. Every model that works on RTX 4060 and 3060 Ti, plus the best upgrade path.
What Can You Actually Run on 12GB VRAM? 14B models at Q4 hit 25-32 tok/s, 7B-8B run at near-lossless Q6-Q8, and SDXL generates without workarounds. Every model that fits on an RTX 3060 12GB and the best upgrade path.
Used RTX 3090 Buying Guide for Local AI 24GB VRAM for $650-750—half the cost of an RTX 4090 with the same capacity. Fair prices, eBay red flags, PSU requirements (850W minimum), and how to test before your return window closes.
Used Optiplex + RTX 3060 = Local AI for Under $450 (Full Build) $100 used Optiplex, $180 RTX 3060 12GB, done. Runs 14B LLMs at 25 tokens/sec and Stable Diffusion out of the box. Complete parts list, where to buy cheap, assembly photos, and first benchmarks.
NVIDIA GPU Prices Are Rising: What to Do Now GPU prices are spiking due to GDDR7 shortages and AI datacenter demand. Here's what's happening, which cards are affected, and strategies for local AI builders.
Best VRAM Cheat Sheet for Local LLMs: Every Model, Every Quant Exact VRAM for Qwen 3.5, Llama, Mistral, and DeepSeek at Q3 through FP16. Lookup tables for 7B, 9B, 13B, 27B, 32B, 70B, and 120B models with real measurements and GPU recommendations. Updated March 2026.
AMD vs NVIDIA for Local AI: Is ROCm Finally Ready? RX 7900 XTX delivers 85-95% of RTX 4090 performance with 24GB VRAM at $700-950. ROCm 6.x finally works on Linux. Honest benchmarks and the real compatibility gaps.
RTX 5060 Ti 16GB Killed? Local AI Alternatives The RTX 5060 Ti 16GB faces production cuts from GDDR7 shortages. See what is really happening and explore the best alternative GPUs for local AI in 2026.
GPU Buying Guide for Local AI: Pick the Right Card The complete GPU buying guide for local AI. Covers RTX 3060 through 4090 with VRAM analysis, performance benchmarks, prices, and used vs new buying advice.

Mac & Apple Silicon (13)

Apple M5 Pro and M5 Max: What 4x Faster LLM Processing Actually Means for Local AI M5 Pro hits 307GB/s, M5 Max doubles to 614GB/s. Neural Accelerators in every GPU core. 128GB runs 70B+ models on a laptop. What actually changes for local AI.
OpenClaw on Mac: Setup, Optimization, and What Actually Works brew install openclaw-cli, connect Ollama, configure the gateway, and stop fighting macOS. Apple Silicon setup, memory math, launchd config, and the gotchas nobody warns you about.
What Can You Run on 8GB Apple Silicon? Local AI on a Budget Mac Llama 3.2 3B runs at 30 tok/s. Phi-4 Mini fits with room to spare. 7B models technically load but swap to disk. Honest benchmarks and real limits for 8GB M1/M2/M3/M4 Macs.
Stable Diffusion on Mac: Image Generation with MLX and Draw Things Draw Things generates SD 1.5 images in 8-15 seconds on an M2 Pro. ComfyUI takes 3x longer. MLX is fastest but code-only. Complete Mac image gen guide with speed tests.
Ollama on Mac: Setup and Optimization Guide (2026) Install Ollama on Apple Silicon, verify Metal GPU is active, and tune it for your Mac's RAM. Config for M1 through M4 Ultra with model picks per memory tier.
Ollama on Mac Not Working? Fix Metal, Memory Pressure, and Slow Performance ollama ps says CPU? Generation crawling at 2 tok/s? macOS killed your model mid-sentence? Every Mac-specific Ollama problem diagnosed and fixed with exact commands.
Mac Studio for Local AI: Is It Worth the Price? Mac Studio M4 Max (128GB) and M3 Ultra (up to 512GB) tested for local LLMs. Real tok/s numbers, cost comparison vs dual RTX 3090, and who should actually buy one.
LM Studio vs Ollama on Mac: Which Should You Use? LM Studio's MLX backend is 20-30% faster and uses half the memory. Ollama is lighter, always-on, and better for APIs. Mac-specific benchmarks and when to use each.
Fine-Tuning on Mac: LoRA & QLoRA with MLX Fine-tune Llama, Qwen, and Mistral on Apple Silicon using mlx-lm. Real memory numbers, step-by-step commands, and how to deploy your model with Ollama.
Best Way to Run Qwen 3.5 on Mac: MLX vs Ollama Speed Test MLX runs Qwen 3.5 up to 2x faster than Ollama on Apple Silicon. Head-to-head benchmarks on M1 through M4, with setup instructions for both.
Mac Mini M4 for Local AI: Which Config to Buy and What It Actually Runs Mac Mini M4 Pro 48GB runs Qwen3-32B at 15-22 tok/s, draws 40W under load, and costs $25/year in electricity. Which config to buy and what each runs.
Best Local LLMs for Mac in 2026 — M1, M2, M3, M4 Tested The best models to run on every Mac tier. Specific picks for 8GB M1 through 128GB M4 Max, with real tok/s numbers. MLX vs Ollama vs LM Studio compared.
Mac Runs 70B Models That Need Multi-GPU on PC — Here's How Your M4 Max loads models that cost $3,000 in GPUs on PC. M1 with 8GB handles 7B, M4 Pro with 48GB runs 32B, and 128GB loads 70B+. MLX vs Ollama speeds tested, plus Mac Mini as a 24/7 AI server.

Image Generation (9)

Best Photorealism Checkpoints for Local Image Generation (2026) Juggernaut XL, RealVisXL, Realistic Vision, and Flux compared for photorealistic AI images. VRAM requirements, recommended settings, sample prompts, and installation for ComfyUI and A1111.
Stable Diffusion on Mac: Image Generation with MLX and Draw Things Draw Things generates SD 1.5 images in 8-15 seconds on an M2 Pro. ComfyUI takes 3x longer. MLX is fastest but code-only. Complete Mac image gen guide with speed tests.
SDXL vs SD 1.5 vs Flux: Which Image Model Should You Run Locally? SDXL vs SD 1.5 vs Flux compared by VRAM, speed, and quality. SD 1.5 needs 4GB, SDXL needs 8GB, Flux needs 12GB+. Benchmarks on real GPUs inside.
LoRA Training on Consumer Hardware: Fine-Tune Models With 12GB VRAM QLoRA fine-tunes a 7B model on an RTX 3060 12GB in 2-4 hours. Full Unsloth and Axolotl recipes, VRAM tables, and the GGUF export pipeline.
ControlNet Guide: Precise AI Image Control on Your GPU ControlNet guide for Stable Diffusion and Flux. Covers Canny, OpenPose, Depth preprocessors, VRAM needs, ComfyUI and A1111 setup, and practical workflows.
AI Art Styles & Workflows: SD and Flux Guide Photorealism, anime, oil painting, concept art, and pixel art on 8GB+ VRAM. Model picks, LoRA stacking at 0.5-0.8 weight, and ComfyUI workflows for each style.
ComfyUI Won — But A1111 Users Should Switch to Forge Neo Instead ComfyUI is faster, uses less VRAM, and gets new model support first. But the 2-3 week learning curve is real. If you're on A1111, Forge Neo gives you Flux support without starting over. Fooocus is dead. Speed tests and VRAM comparisons inside.
Flux Locally: Complete Guide to Running Flux on Your Own GPU Flux needs 12GB VRAM with GGUF quantization or 24GB at full FP16. Generates images with readable text and correct hands in ~60 seconds. ComfyUI setup and optimization tips.
Stable Diffusion Locally: Getting Started SD 1.5 runs on 4GB VRAM, SDXL needs 8GB, Flux needs 12GB+. Generate unlimited images for free in under 5 minutes with Fooocus or ComfyUI. Setup, models, and first image tips.

Models (31)

Qwen 3.5 Small Models: The 9B Beats Last-Gen 30B — Here's What Matters for Local AI Alibaba's Qwen 3.5 drops 4 small models (0.8B to 9B) — all natively multimodal, 262K context, Apache 2.0. The 9B beats Qwen3-30B on reasoning and destroys GPT-5-Nano on vision. VRAM tables and what to run.
Best 8GB GPU Model: How to Set Up Qwen 3.5 9B (Step by Step) Qwen 3.5 9B fits in 6.6GB and beats models 3x its size. Complete setup with Ollama, benchmarks, and real-world testing on RTX 3060 and 4060.
DeepSeek V4: Everything We Know Before It Drops DeepSeek V4 launches next week with native image and video generation, 1M context, and rumored 1T MoE params with only 32B active. Here's what local AI builders need to know and how to prepare.
Best Qwen 3.5 Models Ranked: Every Size, Every GPU, Every Quant Complete ranking of all Qwen 3.5 models from 0.8B to 397B. VRAM requirements, speed benchmarks, and which model to pick for your hardware.
Qwen 3.5 Locally — 27B vs 35B-A3B vs 122B, Which Model Fits Your GPU Qwen 3.5 27B dense vs 35B-A3B MoE vs 122B-A10B compared for local inference. VRAM tables, tok/s benchmarks on RTX 3090 and Mac, thinking mode setup, and which to pick for your hardware.
LiquidAI LFM2: The First Hybrid Model Built for Your Hardware LFM2-24B-A2B runs at 112 tok/s on CPU with only 2.3B active params. Not a transformer. GGUF files from 13.5GB, Ollama and llama.cpp setup, and where it beats Qwen.
Best Way to Run Qwen 3.5 on Mac: MLX vs Ollama Speed Test MLX runs Qwen 3.5 up to 2x faster than Ollama on Apple Silicon. Head-to-head benchmarks on M1 through M4, with setup instructions for both.
RWKV-7: Infinite Context, Zero KV Cache — The Local-First Architecture RWKV-7 uses O(1) memory per token. Context length doesn't increase VRAM. At all. 16 tok/s on a Raspberry Pi. Here's why it matters for local AI and how to run it.
Distilled vs Frontier Models for Local AI — What You're Actually Getting That local model you love was probably trained on stolen outputs from Claude or GPT. Here's what distillation actually does to a model's reasoning, where it breaks, and why it matters most for agentic work.
Best Qwen 3.5 Setup: Which Model Fits Your GPU (Complete Cheat Sheet) Pick the right Qwen 3.5 model for your hardware. Covers 0.8B through 397B with VRAM requirements, quant recommendations, and benchmarks for every GPU tier.
nanollama: Train Your Own Llama 3 From Scratch on Custom Data Pretrain Llama 3 architecture models from raw text, export to GGUF, and run with llama.cpp. Forked from Karpathy's nanochat. 46M to 7B parameters.
MoE Models Explained: Why Mixtral Uses 46B Parameters But Runs Like 13B Mixture of Experts explained for local AI — why MoE models run fast but still need full VRAM. Mixtral, DeepSeek V3, DBRX compared with dense model alternatives.
Qwen vs Llama vs Mistral: Which Model Family Should You Build On? Qwen has 201 languages and a model for every task. Llama has the biggest community. Mistral pioneered efficient MoE. Decision framework for choosing your model family in 2026.
Ouro-2.6B-Thinking: ByteDance's Looped Model That Punches Like an 8B Ouro-2.6B loops through the same transformer blocks 4 times to match 8B models at 2.6B parameters. Under 2GB at Q4. How the architecture works and why it matters.
Mixtral VRAM Requirements: 8x7B and 8x22B at Every Quantization Level Mixtral 8x7B has 46.7B params but only 12.9B activate per token. You still need VRAM for all 46.7B. Exact VRAM for every quant from Q2 to FP16.
Qwen3 Complete Guide: Every Model from 0.6B to 235B Qwen3 is the best open model family for budget local AI. Dense models from 0.6B to 32B, MoE models that punch above their weight, and a /think toggle no one else has.
Llama 4 vs Qwen3 vs DeepSeek V3.2: Which to Run Locally in 2026 Llama 4 needs 55GB. DeepSeek V3.2 needs 350GB. Qwen3 runs on 8GB. Here's who wins at each VRAM tier and use case for local AI in 2026.
Llama 4 Guide: Running Scout and Maverick Locally Complete Llama 4 guide for local AI — Scout (109B MoE, 17B active) and Maverick (400B). VRAM requirements, Ollama setup, benchmarks, and honest hardware reality check.
GPT-OSS Guide: OpenAI's First Open Model for Local AI GPT-OSS 20B is OpenAI's first open-weight model. MoE with 3.6B active params, MXFP4 at 13GB, 128K context, Apache 2.0. Here's how to run it.
DeepSeek V3.2 Guide: What Changed and How to Run It Locally DeepSeek V3.2 competes with GPT-5 on benchmarks. The full model needs 350GB+ VRAM. But the R1 distills run on a $200 used GPU — and they're shockingly good.
CodeLlama vs DeepSeek Coder vs Qwen Coder: Best Local Coding Models Compared CodeLlama vs DeepSeek Coder vs Qwen Coder vs Codestral benchmarked: HumanEval scores, VRAM per quant, and speed tests. Qwen 7B beats CodeLlama 70B.
Phi Models Guide: Microsoft's Small but Mighty LLMs Phi-4 14B scores 84.8% on MMLU — matching models 5x its size — and fits on a 12GB GPU at Q4. The full Phi lineup from 3.8B to 14B with VRAM needs, benchmarks, and honest weaknesses.
Gemma Models Guide: Google's Lightweight Local LLMs Gemma 3 27B beats Gemini 1.5 Pro on benchmarks and runs on a single GPU. The 4B outperforms Gemma 2 27B. Full lineup from 1B to 27B with VRAM needs, speeds, and honest comparisons.
Mixtral 8x7B & 8x22B VRAM Requirements Mixtral 8x7B and 8x22B VRAM requirements at every quantization level. Exact numbers from Q2 to FP16, GPU recommendations, and KV cache impact explained.
Are Mistral Models Still Worth Running? Only Nemo 12B (Here's Why) Mistral led local AI in 2024. In 2026, Qwen 3 and Llama 3 have passed them on most benchmarks. The exception: Mistral Nemo 12B with 128K context still earns its slot. What's worth running, what's been replaced, and when to pick Mistral over the competition.
Qwen Models Guide: The AI Family You're Missing Complete Qwen models guide covering Qwen 3.5, Qwen 3, Qwen 2.5 Coder, and Qwen-VL. VRAM requirements, Ollama setup, Gated DeltaNet architecture, and benchmarks vs Llama and DeepSeek.
Llama 3 Guide: Every Size from 1B to 405B Complete Llama 3 guide covering every model from 1B to 405B. VRAM requirements, Ollama setup, benchmarks vs Qwen 3, and which size fits your hardware.
DeepSeek Models Guide: R1, V3, and Coder Complete DeepSeek models guide covering R1, V3, and Coder locally. Which distilled R1 to pick for your GPU, VRAM requirements, and benchmarks vs Qwen 3.
Model Formats Explained: GGUF vs GPTQ vs AWQ vs EXL2 GGUF vs GPTQ vs AWQ vs EXL2 model formats explained. Learn what each format does, which tools support them, and how to choose the right one for your GPU.
Best Models Under 3B: Small LLMs That Work The best models under 3B parameters for laptops, old GPUs, Raspberry Pi, and phones. What works, what doesn't, and which tiny LLM to pick for your use case.
Quantization Explained: What It Means for Local AI Q4_K_M shrinks a 7B model from 14GB to ~4GB while keeping 90-95% quality. What every quantization format means, how much VRAM each saves, and which to pick for your GPU.

Software & Tools (22)

Home Assistant + Local LLM: Voice Control Your Smart Home Without the Cloud Set up fully local voice control with Home Assistant, Ollama, Whisper, and Piper. No Alexa, no cloud, no subscriptions. Wyoming protocol pipeline, model picks, and hardware options.
LM Studio vs llama.cpp: Why Your Model Runs Slower in the GUI LM Studio uses llama.cpp under the hood but often runs 30-50% slower. Bundled runtime lag, UI overhead, and default settings explain the gap. How to benchmark it yourself and when the convenience is worth it.
Docker for Local AI: The Complete Setup Guide for Ollama, Open WebUI, and GPU Passthrough Run Ollama and Open WebUI in Docker with GPU passthrough. Five copy-paste compose files for NVIDIA, AMD, multi-GPU, and CPU-only setups, plus the Mac gotcha most guides skip.
WSL2 + Ollama on Windows: Complete Setup Guide (GPU Passthrough Included) Install Ollama in WSL2 with full GPU acceleration in 20 minutes. GPU passthrough, Open WebUI, Docker Compose, VPN fixes, and the gotchas that will waste your afternoon.
Ollama on Mac: Setup and Optimization Guide (2026) Install Ollama on Apple Silicon, verify Metal GPU is active, and tune it for your Mac's RAM. Config for M1 through M4 Ultra with model picks per memory tier.
LM Studio vs Ollama on Mac: Which Should You Use? LM Studio's MLX backend is 20-30% faster and uses half the memory. Ollama is lighter, always-on, and better for APIs. Mac-specific benchmarks and when to use each.
Local LLMs vs ChatGPT: An Honest Comparison ChatGPT has web search, voice mode, and GPT-5.2. Local LLMs have privacy, no subscriptions, and no rate limits. Here's when each one wins, what the cost math actually looks like, and why most power users run both.
WSL2 for Local AI: The Complete Windows Setup Guide Install WSL2, configure GPU passthrough, set up Ollama and llama.cpp with CUDA, and optimize memory for LLM inference. Step-by-step for Windows 11.
Best New Ollama 0.17 Features: ollama launch, MLX, and OpenClaw Support Everything new in Ollama 0.16 through 0.17.7: ollama launch for coding tools, native MLX on Apple Silicon, OpenClaw integration, web search API, and image generation. Updated March 2026.
llama.cpp Just Got a New Home: What the Hugging Face Acquisition Means for Local AI ggml.ai — the team behind llama.cpp — is joining Hugging Face. Open source stays open, Georgi keeps the wheel. What changed, what didn't, and what to watch.
Run Qwen2.5-VL Vision in LM Studio (Setup) Get Qwen2.5-VL running in LM Studio in 5 minutes. Covers the mmproj file most people miss, correct download links, and how to analyze images and PDFs locally.
How to Update Models in Ollama — Keep Your Local LLMs Current Ollama doesn't auto-update models. Run ollama pull model:tag to grab the latest version — only changed layers download. Use ollama show to check what you have, and a simple loop to update everything at once.
Best LLM Speed Trick: ExLlamaV2 vs llama.cpp Benchmarks (50-85% Faster) Head-to-head speed benchmarks on RTX 3090 and 4090. ExLlamaV2 generates tokens 50-85% faster than llama.cpp on NVIDIA GPUs. Full comparison with setup guides for both.
Managing Multiple Models in Ollama: Disk Space, Switching, and Organization Five 7B models eat 20GB before you notice. Check what's using space with ollama list, clean up with ollama rm, and set OLLAMA_KEEP_ALIVE to control memory. A practical cleanup and organization guide.
AnythingLLM Setup Guide: Chat With Your Documents Locally Upload PDFs, paste URLs, and chat with your files — no coding, no cloud. AnythingLLM connects to Ollama in 5 minutes with point-and-click RAG on 54K+ GitHub stars.
Text Generation WebUI Setup Guide (2026) Install Oobabooga text-generation-webui, load GGUF/GPTQ/EXL2 models, and configure GPU offloading. Covers the settings most guides skip and common error fixes.
Local LLMs vs Claude: When Each Actually Wins Qwen 3 32B matches Claude on daily tasks at zero marginal cost. Claude still wins on 200K-token documents and multi-step debugging. Benchmarks, pricing, and when to use each.
Fastest Local LLM Setup: Ollama vs vLLM vs llama.cpp Real Benchmarks vLLM handles 4x the concurrent load of Ollama on identical hardware. But for single-user local use, Ollama is all you need. Benchmarks, memory usage, and a dead-simple decision framework. Updated for Ollama v0.17.7, vLLM v0.17.0, and llama.cpp with MCP support.
Open WebUI Setup Guide: ChatGPT UI for Local AI 1 Docker command gives you a ChatGPT-like interface for any Ollama model. 120K+ GitHub stars, built-in RAG, voice chat, and multi-model switching—all running locally.
LM Studio Tips & Tricks: Hidden Features Speculative decoding for 20-50% faster output, MLX that's 21-87% faster on Mac, a built-in OpenAI-compatible API, and the GPU offload settings most users miss.
Run Your First Local LLM in 15 Minutes Install Ollama, pull a model, and chat with AI offline—all in 15 minutes. Works on any Mac, Windows, or Linux machine with 8GB RAM. No accounts, no API keys, no fees.
Ollama vs LM Studio: Which Should You Use for Local AI? Ollama gives you a CLI with 100+ models and an OpenAI-compatible API. LM Studio gives you a visual GUI with one-click downloads. Most power users run both—here's when to use each.

AI Agents & OpenClaw (40)

OpenClaw Trading Scams: How to Spot AI Agent Grifts Before They Cost You AI agent trading scams target technical users who know agent pipelines are buildable. Here's the playbook they use, the math that breaks their claims, and how to protect yourself.
How to Run Karpathy's Autoresearch on Your Local GPU Set up Karpathy's autoresearch on your GPU to run 100+ ML experiments overnight. Works on RTX 3090/4090 as-is, scales down to 6GB cards with tweaks.
Best Ways to Connect Local AI to Notion in 2026 4 real ways to connect Notion to a local LLM without sending data to the cloud. MCP servers, RAG pipelines, Open WebUI, and n8n workflows compared with setup steps.
OpenClaw vs Cursor: Local AI Agent or Cloud IDE? OpenClaw is free, private, and runs your own models. Cursor is polished, fast, and cloud-powered. A developer's comparison: cost, privacy, model flexibility, offline use, and where each one wins.
OpenClaw on Raspberry Pi: What Actually Works (and What Doesn't) Pi 5 with 8GB RAM runs OpenClaw as a gateway with cloud APIs. Local LLMs hit 2-7 tok/s on 1.5B-3B models. Step-by-step setup for llama.cpp, Ollama, and OpenClaw on ARM64.
OpenClaw Model Combinations: What to Pair for Each Task Stop running one model for everything in OpenClaw. Pair Qwen 2.5 Coder 32B for autocomplete, Qwen 3.5 27B for planning, and Qwen3-Coder-Next for agentic coding. Combos by VRAM tier.
Run Your Coding Agent on Local Models with PI Agent + Ollama PI Agent is a free, open-source coding agent that works with any model. Set up PI + Ollama to run a private coding agent on Qwen 3.5 or Qwen3-Coder-Next with zero API costs.
OpenClaw Security Report: February 2026 — ClawHub Malware, Google Suspensions, and Critical Fixes 17 security fixes, 341 malicious ClawHub skills, Google banning users, and the creator leaving. Every OpenClaw security event from February 2026.
Claude Code vs PI Agent — Which Coding Agent for Local AI? Claude Code vs PI Agent compared for local AI development. System prompts, tools, pricing, local model support, and honest verdicts for every type of developer.
OpenClaw Security Hardening — Every Fix in February 2026 SSRF bypass, sandbox escapes, unauthorized disk writes, session hijacking. Every security fix OpenClaw shipped in February 2026, explained in plain English.
OpenClaw on Mac: Setup, Optimization, and What Actually Works brew install openclaw-cli, connect Ollama, configure the gateway, and stop fighting macOS. Apple Silicon setup, memory math, launchd config, and the gotchas nobody warns you about.
OpenClaw After Steinberger — What the OpenAI Move Means for Your Setup Peter Steinberger joined OpenAI. Three releases shipped since. Elon Musk posted a meme. Baby Keem is debugging agents. Here's what actually matters for your OpenClaw setup.
LightClaw: A 7,000-Line Python Alternative to OpenClaw OpenClaw is 40,000+ lines of TypeScript. LightClaw does Telegram AI assistant, 6 LLM providers, memory, skills, and agent delegation in ~7,000 lines of Python. One week old, 12 stars, one developer. Here's what it can and can't do.
Building AI Agents with Local LLMs: A Practical Guide Build AI agents with local LLMs using Ollama and Python. Model requirements, VRAM budgets, framework comparison, working code example, and security warnings.
Best Local Alternatives to Claude Code in 2026 Aider, Continue.dev, Cline, OpenCode, Void, and Tabby compared. Which open-source coding tools work best with local models on your own GPU?
The Web Is Forking: What the Agentic Web Means for Local AI Builders Coinbase, Stripe, Cloudflare, Google, OpenAI, and Visa are building a parallel web for AI agents. Money, search, content, execution — all redesigned for software clients. What local AI builders should do now.
SmarterRouter: A VRAM-Aware LLM Gateway for Your Local AI Lab Intelligent router that profiles your models, manages VRAM, caches responses semantically, and auto-picks the best model per prompt. Works with Ollama and llama.cpp.
LocalAgent: A Local-First Agent Runtime That Actually Cares About Safety Rust CLI for AI agents with deny-by-default permissions, approval workflows, and deterministic replay. Works with LM Studio, Ollama, and llama.cpp.
The 5 Levels of AI Coding: Where Are You, and Where Is This Going? A 3-person team ships production Rust with zero human code. Most devs using AI get 19% slower. The gap between these facts is where software development lives now.
OpenClaw's Creator Just Joined OpenAI — Here's What It Means for Local AI Agents Peter Steinberger built the fastest-growing open-source project ever. Now he's at OpenAI. OpenClaw stays open. Here's what changes for local AI builders.
Running OpenClaw 100% Local — Zero API Costs Configure OpenClaw to run entirely through Ollama with no API keys, no cloud calls, and no monthly bills. Full setup guide with model picks by VRAM tier.
OpenClaw Tool Call Failures: Why Models Break and How to Fix Them Your OpenClaw agent silently fails, loops forever, or corrupts its session. Here's why tool calls break and how to fix each failure mode.
OpenClaw Model Routing: Cheap Models for Simple Tasks, Smart Models When Needed Stop paying Opus prices for heartbeats. Set up tiered model routing in openclaw.json so cheap models handle 80% of work and frontier models only fire when needed.
OpenClaw Memory Problems: Context Rot and the Forgetting Fix Your OpenClaw agent forgets instructions, repeats questions, and contradicts itself in long sessions. Here's how its memory works and how to fix it.
Best Hardware for Running OpenClaw — Mac Mini vs VPS vs Your Old PC OpenClaw runs 24/7. A Mac Mini M4 draws 4 watts idle. A free Oracle VPS costs nothing. A used ThinkCentre costs $85. Here's which one to pick.
Function Calling with Local LLMs: Tools, Agents, and Structured Output Function calling with local LLMs using Ollama and llama.cpp. Qwen 2.5 7B matches GPT-4 accuracy for tool selection. Working code and agentic loop patterns.
Structured Output from Local LLMs: JSON, YAML, and Schemas Ollama's format parameter guarantees valid JSON from any local model. Grammar constraints in llama.cpp go further — 100% schema compliance at the token level. Methods ranked by reliability, with working code examples.
ClawHub Malware Alert: Top Skills Infected The #1 ClawHub skill was malware. How it stole API keys, Cisco's new scanner tool, the MoldBot data leak, and 7 things to do right now to protect yourself.
Best OpenClaw Tools and Extensions in 2026 Crabwalk visualizes agent actions in real time, Tokscale catches API bills before they hit $200+, and openclaw-docker locks down deployment. The best 3rd-party tools ranked.
Best OpenClaw Alternatives: 7 Tools That Actually Work in 2026 Tested alternatives to OpenClaw for local AI agent workflows. Ranked by setup ease, model support, and what actually works without cloud dependencies.
Slash Your AI Costs With a Token Audit Your AI API bill is higher than it needs to be. A 15-minute token audit finds the waste — system prompts, ballooning history, hidden tool tokens. Here's the exact process.
OpenClaw Token Optimization: Cut Costs 97% Cut OpenClaw API costs by 97% with three proven fixes: route heartbeats through Ollama, add tiered model routing, and purge session history token bloat.
OpenClaw Plugins & Skills Marketplace: Complete Guide Every OpenClaw skill worth installing, how to avoid malicious plugins on ClawHub, and how to build your own. 1,103 of 14,706 skills are malicious.
OpenClaw ClawHub Alert: 1,103 Malicious Skills Found OpenClaw ClawHub security alert: 1,103 malicious skills found across 14,706 audited. CVE-2026-28458 Browser Relay auth bypass. How to protect yourself now.
How OpenClaw Actually Works: Architecture Guide 5 input types explain the 'alive' behavior: messages, heartbeats, crons, hooks, and webhooks feed a single agent loop. The 3am phone call was just a timer event.
OpenClaw vs Commercial AI Agents: Which Should You Use? OpenClaw costs $0 plus API fees. Lindy runs $49-299/month but has 7,000+ integrations and SOC 2 compliance. Privacy, reliability, and customization compared honestly.
Best Local Models for OpenClaw Agent Tasks Qwen 3.5 27B on 24GB VRAM is the sweet spot for local agents — SWE-bench 72.4, 262K context, tool calling fixed in Ollama v0.17.6+. Model picks by VRAM tier and the 'society of minds' setup power users run.
OpenClaw Setup Guide: Run a Local AI Agent Run `npx openclaw@latest`, scan a QR code for WhatsApp, and your AI agent is live. Gateway needs just 2-4GB RAM. Add Ollama for local models or connect Claude/GPT-4 via API.
OpenClaw Security Guide: Risks and Hardening 42,000+ exposed instances, Google suspending accounts that connected via OAuth, 26% of ClawHub skills with vulnerabilities. Real risks, prompt injection demos, and step-by-step hardening for OpenClaw.
OpenClaw Security Report: January 2026 Three high-severity CVEs, a supply chain attack on ClawHub, and 21,000+ exposed instances. Every OpenClaw security event from January 2026 with sources.

Use Cases (36)

Running OpenClaw on 4GB, 6GB, and 8GB GPUs: What Actually Works OpenClaw on low VRAM GPUs: 4GB is rough, 6GB is marginal, 8GB is where it starts working. Model picks, quantization tricks, partial offload, and when to just use a cloud API instead.
Local AI for Therapists: Session Notes, Treatment Plans, and Client Privacy Without the Cloud Run AI on your own hardware to draft session notes, treatment plans, and clinical letters without sending client data to OpenAI. HIPAA-friendly setup for therapists.
Local AI for Small Business: Email, Invoicing, and Customer Support Without Monthly Subscriptions A 5-person team spends $1,500-3,000/year on AI subscriptions. A $600 mini PC running Ollama replaces all of them. Here's the setup, the workflows, and the math.
Best Photorealism Checkpoints for Local Image Generation (2026) Juggernaut XL, RealVisXL, Realistic Vision, and Flux compared for photorealistic AI images. VRAM requirements, recommended settings, sample prompts, and installation for ComfyUI and A1111.
Best Anime and Stylized Checkpoints for Local Image Generation (2026) Illustrious XL, NoobAI-XL, Animagine, Pony Diffusion, and SD 1.5 anime models compared. VRAM requirements, Danbooru prompting, LoRA picks, and settings for ComfyUI and A1111.
Fine-Tuning on Mac: LoRA & QLoRA with MLX Fine-tune Llama, Qwen, and Mistral on Apple Silicon using mlx-lm. Real memory numbers, step-by-step commands, and how to deploy your model with Ollama.
Local AI for Lawyers: Confidential Document Analysis Without Cloud Risk A federal judge ordered OpenAI to hand over 20 million chat logs. If you're a lawyer using ChatGPT for client work, that's an ethics problem. Local AI keeps everything on your hardware.
AI Tool Sprawl: You're Running 6 AI Tools and None of Them Talk to Each Other Ollama for local chat, LM Studio for testing, ChatGPT for the hard stuff, Claude for writing, Copilot in your editor, Open WebUI as a frontend. Six tools, zero integration. Here's how to consolidate without losing capability.
Obsidian + Local LLM: Build a Private AI Second Brain Connect Obsidian to a local LLM via Ollama for private AI-powered note search, summaries, and chat. Step-by-step setup with Copilot and Smart Connections.
Crane + Qwen3-TTS: Run Voice Cloning Locally with Rust Clone any voice with 3 seconds of audio using Qwen3-TTS through Crane's pure Rust inference engine. ~4GB VRAM, faster than real-time, Apache 2.0.
PaddleOCR-VL: A 0.9B OCR Model That Runs on Any Potato PaddleOCR-VL does document OCR — text, tables, formulas, charts — in 0.9B parameters. 109 languages. Now runs via llama.cpp and Ollama. Private, local, nearly free.
SDXL vs SD 1.5 vs Flux: Which Image Model Should You Run Locally? SDXL vs SD 1.5 vs Flux compared by VRAM, speed, and quality. SD 1.5 needs 4GB, SDXL needs 8GB, Flux needs 12GB+. Benchmarks on real GPUs inside.
LoRA Training on Consumer Hardware: Fine-Tune Models With 12GB VRAM QLoRA fine-tunes a 7B model on an RTX 3060 12GB in 2-4 hours. Full Unsloth and Axolotl recipes, VRAM tables, and the GGUF export pipeline.
Building a Local AI Assistant: Your Private Jarvis Build a private AI assistant with Ollama, Open WebUI, Whisper, and Kokoro TTS. Voice chat, document Q&A, home automation — all local, no cloud, no subscriptions.
Local AI for Privacy: What's Actually Private Running AI locally keeps prompts off corporate servers — but model downloads, telemetry, and VS Code extensions can still leak data. Here's what's genuinely private, what isn't, and how to close every gap.
Best Uncensored Local LLMs (And Why You Might Want Them) Dolphin 3.0, abliterated Llama 3.3, uncensored Qwen — the best unrestricted local models for fiction, research, and creative work. What uncensored actually means, which models to run, and the quality tradeoffs.
Best Local LLMs for Summarization Qwen 2.5 14B is the summarization sweet spot — strong instruction following, 128K context for 200-page docs, fits on 16GB VRAM. Model picks by use case, quality ratings, chunking strategies, and prompting tips.
Running AI Offline: Complete Guide to Air-Gapped Local LLMs Ollama works fully offline after one download. Pull models, disconnect the network, and your AI keeps running — no accounts, no APIs, no internet. Setup steps, offline RAG, and portable laptop kits.
Embedding Models for RAG: Which to Run Locally nomic-embed-text is still the default for most local RAG setups — 274MB, 8K context, runs on CPU. But Qwen3-Embedding 0.6B just changed the game. Model picks, VRAM needs, speed numbers, and the chunking mistakes that break retrieval.
Best Local LLMs for Translation: What Actually Works NLLB handles 200 languages on 3GB VRAM. Qwen 2.5 matches DeepL for European pairs. Opus-MT runs at 300MB per direction. Which local translation model fits your hardware and language needs.
Best Local LLMs for Data Analysis (2026) Which local models write the best pandas and SQL code on your own hardware. Tested Qwen 2.5 Coder, DeepSeek, and Llama on real datasets with accuracy scores.
ControlNet Guide: Precise AI Image Control on Your GPU ControlNet guide for Stable Diffusion and Flux. Covers Canny, OpenPose, Depth preprocessors, VRAM needs, ComfyUI and A1111 setup, and practical workflows.
Best Vision Models You Can Run Locally: Every Model, Every GPU Tier (2026) Qwen3-VL 8B replaced Qwen2.5-VL as the best local vision model. Full VRAM table, Ollama commands, speed benchmarks, and setup for every GPU from 4GB to 48GB+. Updated March 2026.
Best Local LLMs for RAG in 2026 The best local models for retrieval-augmented generation by VRAM tier. Qwen 3, Command R 35B, embedding models, and RAG stacks with real failure modes.
Local AI Video Generation: What Works in 2026 Wan 2.2 leads on quality, LTX-Video renders 5-second clips in 4 seconds, and 12GB VRAM is the minimum. Speed benchmarks, VRAM charts, and setup for 7 models on consumer GPUs.
AI Art Styles & Workflows: SD and Flux Guide Photorealism, anime, oil painting, concept art, and pixel art on 8GB+ VRAM. Model picks, LoRA stacking at 0.5-0.8 weight, and ComfyUI workflows for each style.
Fine-Tuning LLMs on Consumer Hardware: LoRA and QLoRA Guide Fine-tune a 7B model on 6-10GB VRAM with QLoRA and Unsloth (2-5x faster, 70% less memory). Only 200-500 examples needed. Dataset prep through training on RTX 3060-4090.
ComfyUI Won — But A1111 Users Should Switch to Forge Neo Instead ComfyUI is faster, uses less VRAM, and gets new model support first. But the 2-3 week learning curve is real. If you're on A1111, Forge Neo gives you Flux support without starting over. Fooocus is dead. Speed tests and VRAM comparisons inside.
Best Local LLMs for Math & Reasoning: What Actually Works The best local LLMs for math and reasoning tasks, ranked by VRAM tier. AIME and MATH benchmarks for DeepSeek R1, Qwen 3 thinking, and Phi-4-reasoning.
Talk to Your Local LLM: Voice Chat Setup Under 1 second response time with Whisper + Kokoro TTS + your local model. Full setup guide for Open WebUI voice chat and standalone options. Needs 2-4GB VRAM.
Flux Locally: Complete Guide to Running Flux on Your Own GPU Flux needs 12GB VRAM with GGUF quantization or 24GB at full FP16. Generates images with readable text and correct hands in ~60 seconds. ComfyUI setup and optimization tips.
Best Local LLMs for Chat & Conversation The best local LLMs for chat and conversation in 2026. Picks for every VRAM tier from 8GB to 24GB, with Ollama commands to start chatting immediately.
Local RAG: Search Your Documents with a Private AI Search your private PDFs, notes, and code with a local LLM—no cloud, no API calls. 3 setup methods from zero-config Open WebUI to 30 lines of Python with ChromaDB.
Best Local LLMs for Writing & Creative Work Qwen 2.5 32B on 24GB VRAM is the sweet spot for fiction and long-form. On 8GB, Nous Hermes 3 8B punches above its weight. Model picks for every tier and writing task.
Stable Diffusion Locally: Getting Started SD 1.5 runs on 4GB VRAM, SDXL needs 8GB, Flux needs 12GB+. Generate unlimited images for free in under 5 minutes with Fooocus or ComfyUI. Setup, models, and first image tips.
Best Local Coding Models Ranked: Every VRAM Tier, Every Benchmark (2026) The best local LLMs for coding in 2026, ranked by VRAM tier. Benchmarks, editor setup, and practical recommendations for developers replacing Copilot.

Architecture & Theory (10)

Model Routing for Local AI — Stop Using One Model for Everything You're running one model for every task. That wastes VRAM, burns electricity, and gives worse results. Model routing sends each task to the right model at the right cost. Here's how to set it up.
Speculative Decoding: Free 20-50% Speed Boost for Local LLMs Speculative decoding uses a small draft model to predict tokens verified by the big model. Same output, 20-50% faster. Setup guide for LM Studio and llama.cpp.
KV Cache: Why Context Length Eats Your VRAM (And How to Fix It) The KV cache is why your 8B model OOMs at 32K context. Full formula, worked examples for popular models, and 6 optimization techniques to cut KV VRAM usage.
Why Your AI Keeps Lying: The Hallucination Feedback Loop How one bad memory poisoned our entire RAG pipeline — and the immune system we built to fix it. Real code from mycoSwarm's self-correcting retrieval system.
The AI Memory Wall: Why Your Chatbot Forgets Everything Six architectural reasons ChatGPT, Claude, and Gemini forget your conversations — and how local AI setups solve the memory problem with persistent storage and RAG.
Session-as-RAG: Teaching Your Local AI to Actually Remember Build persistent conversation memory for local LLMs. Chunk sessions, embed in ChromaDB, retrieve relevant past exchanges at query time. Full Python implementation with topic splitting and date citations.
Beyond Transformers: 5 Architectures for Your $50 Mini PC We benchmarked RWKV-7 vs gemma3 on a $50 mini PC. The transformer crashed at turn 6. Here are 5 alternative architectures that run better on budget hardware.
Building a Distributed AI Swarm for Under $1,100 A complete bill of materials for a three-node distributed AI cluster: RTX 3090 workstation, ThinkCentre M710Q for light inference, Raspberry Pi 5 coordinator. Every part sourced used or cheap, total cost under $1,100.
mycoSwarm vs Exo vs Petals vs Nanobot: What's Actually Different Exo distributes inference across Macs. Petals shares GPUs with strangers. Nanobot routes your queries to Chinese clouds without asking. The real question: who controls where your prompts go?
Context Length Explained: Why It Eats Your VRAM What context length actually means for local LLMs, how it affects VRAM usage, practical limits for different hardware, and when you actually need 128K+ tokens.

Troubleshooting (18)

LLM Running Slow? Two Different Problems, Two Different Fixes Slow local LLM? Separate time-to-first-token from generation speed. Fix prompt processing with batch size and Flash Attention. Fix tok/s with GPU layers, quantization, and context length.
Why Your Local LLM Is Slow: The num_ctx VRAM Overflow Nobody Warns You About DeepSeek-R1 14B went from 35 tok/s to 4.8 tok/s on the same GPU. The fix was one parameter. How num_ctx silently overflows VRAM and kills inference speed.
Qwen2.5-VL Not Loading in LM Studio? Fix mmproj and Vision Errors Fix every Qwen2.5-VL error in LM Studio: missing mmproj, 'model type not supported', no eye icon, vision crashes. Exact fixes with file paths.
Open WebUI Not Connecting to Ollama? Every Fix Docker networking, wrong OLLAMA_BASE_URL, localhost confusion, WSL2 isolation, missing models, random disconnects. Every Open WebUI + Ollama connection problem with the exact fix.
Ollama on Mac Not Working? Fix Metal, Memory Pressure, and Slow Performance ollama ps says CPU? Generation crawling at 2 tok/s? macOS killed your model mid-sentence? Every Mac-specific Ollama problem diagnosed and fixed with exact commands.
The 8GB VRAM Trap: What 'Runs on 8GB' Actually Means Every local AI tutorial says 'runs on 8GB!' — and technically it does. What they don't tell you about quantization cliffs, tiny context windows, and why a $275 used GPU changes everything.
Why Is My Local LLM So Slow? A Diagnostic Guide Local LLM running slow? Check GPU vs CPU inference, VRAM offloading, quantization, context length, backend choice, and thermals. Find your fix in 60 seconds.
ROCm Not Detecting GPU: AMD Troubleshooting Guide AMD GPU not detected in ROCm? Check supported GPUs, fix rocminfo errors, HSA_OVERRIDE hack for unsupported cards, and Ollama/llama.cpp ROCm build fixes.
Ollama Not Using GPU: Complete Fix Guide Ollama running on CPU instead of GPU? Diagnose with ollama ps and nvidia-smi, then fix CUDA drivers, ROCm setup, VRAM limits, and Docker GPU passthrough.
Ollama API Connection Refused: Quick Fixes Ollama API returning connection refused? Check if it's running, fix the port, open it to the network, and solve Docker and WSL2 connectivity issues.
Model Outputs Garbage: Debugging Bad Generations Local LLM outputs repetitive loops, gibberish, or wrong answers? Seven causes with exact fixes — from corrupted downloads to wrong chat templates.
Memory Leak in Long Conversations: Causes and Fixes VRAM climbs with every message until your model crashes? It's probably KV cache growth, not a leak. How to diagnose, monitor, and fix memory issues in local LLMs.
llama.cpp Build Errors: Common Fixes for Every Platform llama.cpp won't build? CMake too old, CUDA not found, Metal not enabling, Visual Studio missing. Exact error messages and one-liner fixes for every platform.
GGUF File Won't Load: Format and Compatibility Fixes GGUF model won't load? Version mismatch, corrupted download, wrong format, split files, or memory issues. Find your error and fix it in under a minute.
CUDA Out of Memory: What It Means and How to Fix It CUDA out of memory means your model doesn't fit in VRAM. Seven fixes ranked by effort — context length, KV cache quantization, model quant, CPU offload — with tool-specific commands for Ollama, llama.cpp, and LM Studio.
Context Length Exceeded: What To Do When Your Model Runs Out of Space Model forgetting earlier messages or throwing context errors? How context length works, what happens when it fills, and practical fixes for chat, RAG, and coding.
Local AI Troubleshooting Guide: Every Common Problem and Fix Model running 30x slower than expected? Probably on CPU instead of GPU. Fixes for won't-load errors, CUDA crashes, garbled output, and OOM across Ollama and LM Studio.
Ollama Troubleshooting Guide: Every Common Problem and Fix GPU not detected? Running at 1/30th speed on CPU? OOM crashes mid-generation? Every common Ollama error with exact diagnostic commands and fixes for Mac, Windows, and Linux. Updated for v0.17.7 and Qwen 3.5.