Optimization
Speculative Decoding: Free 20-50% Speed Boost for Local LLMs
Speculative decoding uses a small draft model to predict tokens verified by the big model. Same output, 20-50% faster. Setup guide for LM Studio and llama.cpp.
KV Cache: Why Context Length Eats Your VRAM (And How to Fix It)
The KV cache is why your 8B model OOMs at 32K context. Full formula, worked examples for popular models, and 6 optimization techniques to cut KV VRAM usage.
LM Studio Tips & Tricks: Hidden Features
Speculative decoding for 20-50% faster output, MLX that's 21-87% faster on Mac, a built-in OpenAI-compatible API, and the GPU offload settings most users miss.