Inference

TurboQuant Explained: How Google's KV Cache Trick Cuts Memory 6x With Zero Quality Loss
Google's TurboQuant compresses the KV cache 6x with zero accuracy loss. Here's what it actually does, how it works in llama.cpp and MLX, and what it means for running bigger models on your GPU.
Mar 30, 2026
Run LLMs on Old Phones: A Practical Guide to Mobile AI Inference
That old Pixel 6 or Galaxy S21 in your drawer can run a local LLM. Realistic tok/s by phone tier, Termux setup, app options, and an honest phone vs Raspberry Pi comparison.
Mar 6, 2026
LLM Running Slow? Two Different Problems, Two Different Fixes
Slow local LLM? Separate time-to-first-token from generation speed. Fix prompt processing with batch size and Flash Attention. Fix tok/s with GPU layers, quantization, and context length.
Mar 5, 2026
Apple Neural Engine for LLM Inference: What Actually Works
Apple Silicon has a dedicated Neural Engine that most LLM tools ignore. Here's what it can do for inference, what it can't, and whether ANE-based tools like ANEMLL are worth trying today.
Mar 5, 2026
SmarterRouter: A VRAM-Aware LLM Gateway for Your Local AI Lab
Intelligent router that profiles your models, manages VRAM, caches responses semantically, and auto-picks the best model per prompt. Works with Ollama and llama.cpp.
Feb 21, 2026
Fastest Local LLM Setup: Ollama vs vLLM vs llama.cpp Real Benchmarks
vLLM handles 4x the concurrent load of Ollama on identical hardware. But for single-user local use, Ollama is all you need — except on Qwen 3.6, where the mmproj bug forces you to llama.cpp. Benchmarks, ik_llama.cpp for MoE, and a clean decision framework.
Feb 3, 2026
CPU-Only LLMs: What Actually Works
Running CPU-only LLMs without a GPU — what actually works. Best model picks, real speed benchmarks, and a budget dual Xeon server build for 70B models.
Jan 29, 2026