Kv-Cache
TurboQuant Explained: How Google's KV Cache Trick Cuts Memory 6x With Zero Quality Loss
Google's TurboQuant compresses the KV cache 6x with zero accuracy loss. Here's what it actually does, how it works in llama.cpp and MLX, and what it means for running bigger models on your GPU.
KV Cache: Why Context Length Eats Your VRAM (And How to Fix It)
The KV cache is why your 8B model OOMs at 32K context. Full formula, worked examples for popular models, and 6 optimization techniques to cut KV VRAM usage.
Memory Leak in Long Conversations: Causes and Fixes
VRAM climbs with every message until your model crashes? It's probably KV cache growth, not a leak. How to diagnose, monitor, and fix memory issues in local LLMs.
CUDA Out of Memory: What It Means and How to Fix It
CUDA out of memory means your model doesn't fit in VRAM. Seven fixes ranked by effort — context length, KV cache quantization, model quant, CPU offload — with tool-specific commands for Ollama, llama.cpp, and LM Studio.