Troubleshooting
The 8GB VRAM Trap: What 'Runs on 8GB' Actually Means
Every local AI tutorial says 'runs on 8GB!' — and technically it does. What they don't tell you about quantization cliffs, tiny context windows, and why a $275 used GPU changes everything.
Why Is My Local LLM So Slow? A Diagnostic Guide
Local LLM running slow? Check GPU vs CPU inference, VRAM offloading, quantization, context length, backend choice, and thermals. Find your fix in 60 seconds.
ROCm Not Detecting GPU: AMD Troubleshooting Guide
AMD GPU not detected in ROCm? Check supported GPUs, fix rocminfo errors, HSA_OVERRIDE hack for unsupported cards, and Ollama/llama.cpp ROCm build fixes.
Ollama Not Using GPU: Complete Fix Guide
Ollama running on CPU instead of GPU? Diagnose with ollama ps and nvidia-smi, then fix CUDA drivers, ROCm setup, VRAM limits, and Docker GPU passthrough.
Ollama API Connection Refused: Quick Fixes
Ollama API returning connection refused? Check if it's running, fix the port, open it to the network, and solve Docker and WSL2 connectivity issues.
Model Outputs Garbage: Debugging Bad Generations
Local LLM outputs repetitive loops, gibberish, or wrong answers? Seven causes with exact fixes — from corrupted downloads to wrong chat templates.
Memory Leak in Long Conversations: Causes and Fixes
VRAM climbs with every message until your model crashes? It's probably KV cache growth, not a leak. How to diagnose, monitor, and fix memory issues in local LLMs.
llama.cpp Build Errors: Common Fixes for Every Platform
llama.cpp won't build? CMake too old, CUDA not found, Metal not enabling, Visual Studio missing. Exact error messages and one-liner fixes for every platform.
GGUF File Won't Load: Format and Compatibility Fixes
GGUF model won't load? Version mismatch, corrupted download, wrong format, split files, or memory issues. Find your error and fix it in under a minute.
CUDA Out of Memory: What It Means and How to Fix It
CUDA out of memory means your model doesn't fit in VRAM. Six fixes ranked by effort — from closing Chrome to CPU offloading — plus how to prevent it.
Context Length Exceeded: What To Do When Your Model Runs Out of Space
Model forgetting earlier messages or throwing context errors? How context length works, what happens when it fills, and practical fixes for chat, RAG, and coding.
Local AI Troubleshooting Guide: Every Common Problem and Fix
Model running 30x slower than expected? Probably on CPU instead of GPU. Fixes for won't-load errors, CUDA crashes, garbled output, and OOM across Ollama and LM Studio.
Ollama Troubleshooting Guide: Every Common Problem and Fix
GPU not detected? Running at 1/30th speed on CPU? OOM crashes mid-generation? Every common Ollama error with exact diagnostic commands and fixes for Mac, Windows, and Linux.