Optimization

TurboQuant Explained: How Google's KV Cache Trick Cuts Memory 6x With Zero Quality Loss
Google's TurboQuant compresses the KV cache 6x with zero accuracy loss. Here's what it actually does, how it works in llama.cpp and MLX, and what it means for running bigger models on your GPU.
Mar 30, 2026
Why the Best AI Agents Know When to Do Nothing
Six practical patterns for building AI agents that stop wasting tokens. Confidence gates, cost checks, explicit no-ops, cooldowns, and exit conditions that actually work.
Mar 11, 2026
Ollama on Mac: Setup and Optimization Guide (2026)
Install Ollama on Apple Silicon, verify Metal GPU is active, and tune it for your Mac's RAM. Config for M1 through M4 Ultra with model picks per memory tier.
Feb 26, 2026
Speculative Decoding: Free 20-50% Speed Boost for Local LLMs
Speculative decoding uses a small draft model to predict tokens verified by the big model. Same output, 20-50% faster. Setup guide for LM Studio and llama.cpp.
Feb 23, 2026
KV Cache: Why Context Length Eats Your VRAM (And How to Fix It)
The KV cache is why your 8B model OOMs at 32K context. Full formula, worked examples for popular models, and 6 optimization techniques to cut KV VRAM usage.
Feb 23, 2026
LM Studio Tips & Tricks: Hidden Features
Speculative decoding for 20-50% faster output, MLX that's 21-87% faster on Mac, a built-in OpenAI-compatible API, and the GPU offload settings most users miss.
Feb 1, 2026