Inference
Run LLMs on Old Phones: A Practical Guide to Mobile AI Inference
That old Pixel 6 or Galaxy S21 in your drawer can run a local LLM. Realistic tok/s by phone tier, Termux setup, app options, and an honest phone vs Raspberry Pi comparison.
LLM Running Slow? Two Different Problems, Two Different Fixes
Slow local LLM? Separate time-to-first-token from generation speed. Fix prompt processing with batch size and Flash Attention. Fix tok/s with GPU layers, quantization, and context length.
Apple Neural Engine for LLM Inference: What Actually Works
Apple Silicon has a dedicated Neural Engine that most LLM tools ignore. Here's what it can do for inference, what it can't, and whether ANE-based tools like ANEMLL are worth trying today.
SmarterRouter: A VRAM-Aware LLM Gateway for Your Local AI Lab
Intelligent router that profiles your models, manages VRAM, caches responses semantically, and auto-picks the best model per prompt. Works with Ollama and llama.cpp.
Fastest Local LLM Setup: Ollama vs vLLM vs llama.cpp Real Benchmarks
vLLM handles 4x the concurrent load of Ollama on identical hardware. But for single-user local use, Ollama is all you need. Benchmarks, memory usage, and a dead-simple decision framework. Updated for Ollama v0.17.7, vLLM v0.17.0, and llama.cpp with MCP support.
CPU-Only LLMs: What Actually Works
Running CPU-only LLMs without a GPU — what actually works. Best model picks, real speed benchmarks, and a budget dual Xeon server build for 70B models.