Gotchas
The Local AI Complexity Cliff: Why the Jump from Hello World to Useful Is So Hard
Getting Ollama running takes 5 minutes. Building something useful takes weeks of hitting walls you didn't know existed. Here's an honest map of every stage, with time estimates and what unlocks at each level.
The Benchmarks Lie: Why LLM Scores Don't Predict Real-World Performance
MMLU scores drop 14-17 points when contamination is removed. HumanEval is saturated at 94%. Models trained on the test set. Here's what to measure instead.
The 8GB VRAM Trap: What 'Runs on 8GB' Actually Means
Every local AI tutorial says 'runs on 8GB!' — and technically it does. What they don't tell you about quantization cliffs, tiny context windows, and why a $275 used GPU changes everything.
Prompt Debt: When Your System Prompt Becomes Unmaintainable Spaghetti
Your system prompt started at 200 words. Six months later it's 3,000 words of contradictory instructions and panic patches. Here's how prompt debt accumulates, what it costs, and how to pay it down.
Ghost Knowledge: When Your RAG System Cites Documents That No Longer Exist
Your RAG system confidently quotes a policy that was updated months ago. The old version is still in the vector database. Nobody notices until the wrong answer costs real money. Here's how to find and fix ghost knowledge.
Distilled vs Frontier Models for Local AI — What You're Actually Getting
That local model you love was probably trained on stolen outputs from Claude or GPT. Here's what distillation actually does to a model's reasoning, where it breaks, and why it matters most for agentic work.
AI Tool Sprawl: You're Running 6 AI Tools and None of Them Talk to Each Other
Ollama for local chat, LM Studio for testing, ChatGPT for the hard stuff, Claude for writing, Copilot in your editor, Open WebUI as a frontend. Six tools, zero integration. Here's how to consolidate without losing capability.