Troubleshooting

LLM Running Slow? Two Different Problems, Two Different Fixes
Slow local LLM? Separate time-to-first-token from generation speed. Fix prompt processing with batch size and Flash Attention. Fix tok/s with GPU layers, quantization, and context length.
Mar 5, 2026
Why Your Local LLM Is Slow: The num_ctx VRAM Overflow Nobody Warns You About
DeepSeek-R1 14B went from 35 tok/s to 4.8 tok/s on the same GPU. The fix was one parameter. How num_ctx silently overflows VRAM and kills inference speed.
Mar 3, 2026
Qwen2.5-VL Not Loading in LM Studio? Fix mmproj and Vision Errors
Fix every Qwen2.5-VL error in LM Studio: missing mmproj, 'model type not supported', no eye icon, vision crashes. Exact fixes with file paths.
Feb 26, 2026
Open WebUI Not Connecting to Ollama? Every Fix
Docker networking, wrong OLLAMA_BASE_URL, localhost confusion, WSL2 isolation, missing models, random disconnects. Every Open WebUI + Ollama connection problem with the exact fix.
Feb 26, 2026
Ollama on Mac Not Working? Fix Metal, Memory Pressure, and Slow Performance
ollama ps says CPU? Generation crawling at 2 tok/s? macOS killed your model mid-sentence? Every Mac-specific Ollama problem diagnosed and fixed with exact commands.
Feb 26, 2026
The 8GB VRAM Trap: What 'Runs on 8GB' Actually Means
Every local AI tutorial says 'runs on 8GB!' — and technically it does. What they don't tell you about quantization cliffs, tiny context windows, and why a $275 used GPU changes everything.
Feb 25, 2026
Why Is My Local LLM So Slow? A Diagnostic Guide
Local LLM running slow? Check GPU vs CPU inference, VRAM offloading, quantization, context length, backend choice, and thermals. Find your fix in 60 seconds.
Feb 18, 2026
ROCm Not Detecting GPU: AMD Troubleshooting Guide
AMD GPU not detected in ROCm? Check supported GPUs, fix rocminfo errors, HSA_OVERRIDE hack for unsupported cards, and Ollama/llama.cpp ROCm build fixes.
Feb 18, 2026
Ollama Not Using GPU: Complete Fix Guide
Ollama running on CPU instead of GPU? Diagnose with ollama ps and nvidia-smi, then fix CUDA drivers, ROCm setup, VRAM limits, and Docker GPU passthrough.
Feb 18, 2026
Ollama API Connection Refused: Quick Fixes
Ollama API returning connection refused? Check if it's running, fix the port, open it to the network, and solve Docker and WSL2 connectivity issues.
Feb 18, 2026
Model Outputs Garbage: Debugging Bad Generations
Local LLM outputs repetitive loops, gibberish, or wrong answers? Seven causes with exact fixes — from corrupted downloads to wrong chat templates.
Feb 18, 2026
Memory Leak in Long Conversations: Causes and Fixes
VRAM climbs with every message until your model crashes? It's probably KV cache growth, not a leak. How to diagnose, monitor, and fix memory issues in local LLMs.
Feb 18, 2026
llama.cpp Build Errors: Common Fixes for Every Platform
llama.cpp won't build? CMake too old, CUDA not found, Metal not enabling, Visual Studio missing. Exact error messages and one-liner fixes for every platform.
Feb 18, 2026
GGUF File Won't Load: Format and Compatibility Fixes
GGUF model won't load? Version mismatch, corrupted download, wrong format, split files, or memory issues. Find your error and fix it in under a minute.
Feb 18, 2026
CUDA Out of Memory: What It Means and How to Fix It
CUDA out of memory means your model doesn't fit in VRAM. Seven fixes ranked by effort — context length, KV cache quantization, model quant, CPU offload — with tool-specific commands for Ollama, llama.cpp, and LM Studio.
Feb 18, 2026
Context Length Exceeded: What To Do When Your Model Runs Out of Space
Model forgetting earlier messages or throwing context errors? How context length works, what happens when it fills, and practical fixes for chat, RAG, and coding.
Feb 18, 2026
Local AI Troubleshooting Guide: Every Common Problem and Fix
Model running 30x slower than expected? Probably on CPU instead of GPU. Fixes for won't-load errors, CUDA crashes, garbled output, and OOM across Ollama and LM Studio.
Feb 3, 2026
LM Studio Tips & Tricks: Hidden Features
Speculative decoding for 20-50% faster output, MLX that's 21-87% faster on Mac, a built-in OpenAI-compatible API, and the GPU offload settings most users miss.
Feb 1, 2026
Ollama Troubleshooting Guide: Every Common Problem and Fix
GPU not detected? Running at 1/30th speed on CPU? OOM crashes mid-generation? Every common Ollama error with exact diagnostic commands and fixes for Mac, Windows, and Linux. Updated for v0.17.7 and Qwen 3.5.
Jan 31, 2026
Model Formats Explained: GGUF vs GPTQ vs AWQ vs EXL2
GGUF vs GPTQ vs AWQ vs EXL2 model formats explained. Learn what each format does, which tools support them, and how to choose the right one for your GPU.
Jan 30, 2026