Inference

Run LLMs on Old Phones: A Practical Guide to Mobile AI Inference
That old Pixel 6 or Galaxy S21 in your drawer can run a local LLM. Realistic tok/s by phone tier, Termux setup, app options, and an honest phone vs Raspberry Pi comparison.
Mar 6, 2026
LLM Running Slow? Two Different Problems, Two Different Fixes
Slow local LLM? Separate time-to-first-token from generation speed. Fix prompt processing with batch size and Flash Attention. Fix tok/s with GPU layers, quantization, and context length.
Mar 5, 2026
Apple Neural Engine for LLM Inference: What Actually Works
Apple Silicon has a dedicated Neural Engine that most LLM tools ignore. Here's what it can do for inference, what it can't, and whether ANE-based tools like ANEMLL are worth trying today.
Mar 5, 2026
SmarterRouter: A VRAM-Aware LLM Gateway for Your Local AI Lab
Intelligent router that profiles your models, manages VRAM, caches responses semantically, and auto-picks the best model per prompt. Works with Ollama and llama.cpp.
Feb 21, 2026
Fastest Local LLM Setup: Ollama vs vLLM vs llama.cpp Real Benchmarks
vLLM handles 4x the concurrent load of Ollama on identical hardware. But for single-user local use, Ollama is all you need. Benchmarks, memory usage, and a dead-simple decision framework. Updated for Ollama v0.17.7, vLLM v0.17.0, and llama.cpp with MCP support.
Feb 3, 2026
CPU-Only LLMs: What Actually Works
Running CPU-only LLMs without a GPU — what actually works. Best model picks, real speed benchmarks, and a budget dual Xeon server build for 70B models.
Jan 29, 2026