Llama-Cpp

How to Fix Slow Qwen 3.6 27B on RTX 3090 (10-80 tok/s)
Qwen 3.6-27B at 12 tok/s on a 3090 when others report 35? The 8-step diagnostic checklist for offload, quants, templates, power limits, and backend choice.
May 1, 2026
Best Way to Get 2x Token Output on RTX 3090: Qwen 3.6 + DFlash
Luce DFlash + DDTree pushes Qwen 3.6-27B Q4_K_M from 35 tok/s to 69 tok/s on a single RTX 3090. Real benchmarks, setup, and honest limits.
Apr 27, 2026
FP4 Just Landed in llama.cpp: NVFP4 vs MXFP4 Explained (2026)
NVFP4 in llama.cpp, MXFP4 in ik_llama.cpp. The first practical FP4 quantization for the GGUF ecosystem — what works, what doesn't, and what to test.
Apr 25, 2026
Run LLMs on Old Phones: A Practical Guide to Mobile AI Inference
That old Pixel 6 or Galaxy S21 in your drawer can run a local LLM. Realistic tok/s by phone tier, Termux setup, app options, and an honest phone vs Raspberry Pi comparison.
Mar 6, 2026
LLM Running Slow? Two Different Problems, Two Different Fixes
Slow local LLM? Separate time-to-first-token from generation speed. Fix prompt processing with batch size and Flash Attention. Fix tok/s with GPU layers, quantization, and context length.
Mar 5, 2026
Intel Arc B580 for Local LLMs: 12GB VRAM at $250, With Caveats
The Arc B580 gives you 12GB VRAM for $250, but Intel's AI software stack needs work. Real tok/s benchmarks, setup paths, and honest comparison with RTX 3060.
Mar 5, 2026
Best Qwen 3.5 Models Ranked: Every Size, Every GPU, Every Quant
Complete ranking of all Qwen 3.5 models from 0.8B to 397B. VRAM requirements, speed benchmarks, and which model to pick for your hardware.
Feb 28, 2026
Speculative Decoding: Free 20-50% Speed Boost for Local LLMs
Speculative decoding uses a small draft model to predict tokens verified by the big model. Same output, 20-50% faster. Setup guide for LM Studio and llama.cpp.
Feb 23, 2026
SmarterRouter: A VRAM-Aware LLM Gateway for Your Local AI Lab
Intelligent router that profiles your models, manages VRAM, caches responses semantically, and auto-picks the best model per prompt. Works with Ollama and llama.cpp.
Feb 21, 2026
PaddleOCR-VL: A 0.9B OCR Model That Runs on Any Potato
PaddleOCR-VL does document OCR — text, tables, formulas, charts — in 0.9B parameters. 109 languages. Now runs via llama.cpp and Ollama. Private, local, nearly free.
Feb 20, 2026
llama.cpp Just Got a New Home: What the Hugging Face Acquisition Means for Local AI
ggml.ai — the team behind llama.cpp — is joining Hugging Face. Open source stays open, Georgi keeps the wheel. What changed, what didn't, and what to watch.
Feb 20, 2026
Why Is My Local LLM So Slow? A Diagnostic Guide
Local LLM running slow? Check GPU vs CPU inference, VRAM offloading, quantization, context length, backend choice, and thermals. Find your fix in 60 seconds.
Feb 18, 2026
llama.cpp Build Errors: Common Fixes for Every Platform
llama.cpp won't build or runs wrong? CMake, CUDA, Gemma 4 thinking-mode, Qwen 3.6 kwargs, num_ctx VRAM overflow. Exact fixes for every platform.
Feb 18, 2026
GGUF File Won't Load: Format and Compatibility Fixes
GGUF model won't load? Version mismatch, corrupted download, wrong format, split files, or memory issues. Find your error and fix it in under a minute.
Feb 18, 2026
Best LLM Speed Trick: ExLlamaV2 vs llama.cpp Benchmarks (50-85% Faster)
Head-to-head speed benchmarks on RTX 3090 and 4090. ExLlamaV2 generates tokens 50-85% faster than llama.cpp on NVIDIA GPUs. Full comparison with setup guides for both.
Feb 14, 2026