Llama.cpp
RTX 5090 vs DGX Spark vs AMD: The Ultimate Local LLM Benchmark (2026)
Real llama.cpp benchmarks across RTX 5090, DGX Spark, and AMD AI395 with ROCm and Vulkan. Token speeds, VRAM usage, and which hardware wins for local AI.
OpenClaw on Raspberry Pi: What Actually Works (and What Doesn't)
Pi 5 with 8GB RAM runs OpenClaw as a gateway with cloud APIs. Local LLMs hit 2-7 tok/s on 1.5B-3B models. Step-by-step setup for llama.cpp, Ollama, and OpenClaw on ARM64.
LM Studio vs llama.cpp: Why Your Model Runs Slower in the GUI
LM Studio uses llama.cpp under the hood but often runs 30-50% slower. Bundled runtime lag, UI overhead, and default settings explain the gap. How to benchmark it yourself and when the convenience is worth it.
KV Cache: Why Context Length Eats Your VRAM (And How to Fix It)
The KV cache is why your 8B model OOMs at 32K context. Full formula, worked examples for popular models, and 6 optimization techniques to cut KV VRAM usage.
CUDA Out of Memory: What It Means and How to Fix It
CUDA out of memory means your model doesn't fit in VRAM. Seven fixes ranked by effort — context length, KV cache quantization, model quant, CPU offload — with tool-specific commands for Ollama, llama.cpp, and LM Studio.
Function Calling with Local LLMs: Tools, Agents, and Structured Output
Function calling with local LLMs using Ollama and llama.cpp. Qwen 2.5 7B matches GPT-4 accuracy for tool selection. Working code and agentic loop patterns.
Structured Output from Local LLMs: JSON, YAML, and Schemas
Ollama's format parameter guarantees valid JSON from any local model. Grammar constraints in llama.cpp go further — 100% schema compliance at the token level. Methods ranked by reliability, with working code examples.
Local AI for Privacy: What's Actually Private
Running AI locally keeps prompts off corporate servers — but model downloads, telemetry, and VS Code extensions can still leak data. Here's what's genuinely private, what isn't, and how to close every gap.
Multi-GPU Local AI: Run Models Across Multiple GPUs
Dual RTX 3090s give you 48GB VRAM and run 70B models at 16-21 tok/s—vs 1 tok/s with CPU offloading. Tensor vs pipeline parallelism, setup guides, and real scaling numbers.
Local AI Troubleshooting Guide: Every Common Problem and Fix
Model running 30x slower than expected? Probably on CPU instead of GPU. Fixes for won't-load errors, CUDA crashes, garbled output, and OOM across Ollama and LM Studio.
Fastest Local LLM Setup: Ollama vs vLLM vs llama.cpp Real Benchmarks
vLLM handles 4x the concurrent load of Ollama on identical hardware. But for single-user local use, Ollama is all you need. Benchmarks, memory usage, and a dead-simple decision framework. Updated for Ollama v0.17.7, vLLM v0.17.0, and llama.cpp with MCP support.
CPU-Only LLMs: What Actually Works
Running CPU-only LLMs without a GPU — what actually works. Best model picks, real speed benchmarks, and a budget dual Xeon server build for 70B models.