Models
The Benchmarks Lie: Why LLM Scores Don't Predict Real-World Performance
MMLU scores drop 14-17 points when contamination is removed. HumanEval is saturated at 94%. Models trained on the test set. Here's what to measure instead.
RWKV-7: Infinite Context, Zero KV Cache — The Local-First Architecture
RWKV-7 uses O(1) memory per token. Context length doesn't increase VRAM. At all. 16 tok/s on a Raspberry Pi. Here's why it matters for local AI and how to run it.
Qwen 3.5 for Local AI: Which Model, Which Quant, Which GPU
Qwen 3.5 dropped Feb 24 with four models from 27B to 397B. The 35B-A3B runs at 194 tok/s on an RTX 5090. The 27B matches GPT-5 mini on SWE-bench. Here's what to run on your hardware.
LiquidAI LFM2: The Non-Transformer Model Worth Running Locally
LFM2-24B has 24 billion parameters but only 2.3 billion active per token. It's not a transformer. GGUF available, fits in 32GB RAM. Here's what's different and whether it matters for local inference.
Distilled vs Frontier Models for Local AI — What You're Actually Getting
That local model you love was probably trained on stolen outputs from Claude or GPT. Here's what distillation actually does to a model's reasoning, where it breaks, and why it matters most for agentic work.
RWKV-7 Local Guide: Infinite Context, Zero KV Cache, Runs on Anything
RWKV-7 held 4.7 tok/s flat across 10 turns on a $50 mini PC while gemma3 crashed at turn 6. An RNN that trains like a transformer, runs with constant memory, and fits on hardware transformers can't touch.
nanollama: Train Your Own Llama 3 From Scratch on Custom Data
Pretrain Llama 3 architecture models from raw text, export to GGUF, and run with llama.cpp. Forked from Karpathy's nanochat. 46M to 7B parameters.
MoE Models Explained: Why Mixtral Uses 46B Parameters But Runs Like 13B
Mixture of Experts explained for local AI — why MoE models run fast but still need full VRAM. Mixtral, DeepSeek V3, DBRX compared with dense model alternatives.
Qwen vs Llama vs Mistral: Which Model Family Should You Build On?
Qwen has 201 languages and a model for every task. Llama has the biggest community. Mistral pioneered efficient MoE. Decision framework for choosing your model family in 2026.
Ouro-2.6B-Thinking: ByteDance's Looped Model That Punches Like an 8B
Ouro-2.6B loops through the same transformer blocks 4 times to match 8B models at 2.6B parameters. Under 2GB at Q4. How the architecture works and why it matters.
Qwen 3.5: What Local AI Builders Need to Know
Qwen3.5-397B with 17B active params — frontier-class open weights under Apache 2.0. Hardware requirements, quantization options, known bugs, and who can actually run it.
Mixtral VRAM Requirements: 8x7B and 8x22B at Every Quantization Level
Mixtral 8x7B has 46.7B params but only 12.9B activate per token. You still need VRAM for all 46.7B. Exact VRAM for every quant from Q2 to FP16.
Qwen3 Complete Guide: Every Model from 0.6B to 235B
Qwen3 is the best open model family for budget local AI. Dense models from 0.6B to 32B, MoE models that punch above their weight, and a /think toggle no one else has.
Llama 4 vs Qwen3 vs DeepSeek V3.2: Which to Run Locally in 2026
Llama 4 needs 55GB. DeepSeek V3.2 needs 350GB. Qwen3 runs on 8GB. Here's who wins at each VRAM tier and use case for local AI in 2026.
Llama 4 Guide: Running Scout and Maverick Locally
Complete Llama 4 guide for local AI — Scout (109B MoE, 17B active) and Maverick (400B). VRAM requirements, Ollama setup, benchmarks, and honest hardware reality check.
GPT-OSS Guide: OpenAI's First Open Model for Local AI
GPT-OSS 20B is OpenAI's first open-weight model. MoE with 3.6B active params, MXFP4 at 13GB, 128K context, Apache 2.0. Here's how to run it.
DeepSeek V3.2 Guide: What Changed and How to Run It Locally
DeepSeek V3.2 competes with GPT-5 on benchmarks. The full model needs 350GB+ VRAM. But the R1 distills run on a $200 used GPU — and they're shockingly good.
CodeLlama vs DeepSeek Coder vs Qwen Coder: Best Local Coding Models Compared
CodeLlama vs DeepSeek Coder vs Qwen Coder vs Codestral benchmarked: HumanEval scores, VRAM per quant, and speed tests. Qwen 7B beats CodeLlama 70B.
Phi Models Guide: Microsoft's Small but Mighty LLMs
Phi-4 14B scores 84.8% on MMLU — matching models 5x its size — and fits on a 12GB GPU at Q4. The full Phi lineup from 3.8B to 14B with VRAM needs, benchmarks, and honest weaknesses.
Gemma Models Guide: Google's Lightweight Local LLMs
Gemma 3 27B beats Gemini 1.5 Pro on benchmarks and runs on a single GPU. The 4B outperforms Gemma 2 27B. Full lineup from 1B to 27B with VRAM needs, speeds, and honest comparisons.
Mixtral 8x7B & 8x22B VRAM Requirements
Mixtral 8x7B and 8x22B VRAM requirements at every quantization level. Exact numbers from Q2 to FP16, GPU recommendations, and KV cache impact explained.
Mistral & Mixtral Guide: Every Model Worth Running Locally
Nemo 12B with 128K context fits in 8GB VRAM at Q4. Mistral 7B runs on 4GB. Mixtral 8x7B needs 26-32GB and is now outpaced by Qwen 3. What's still worth running and what's been superseded.
Qwen Models Guide: The AI Family You're Missing
Complete Qwen models guide covering Qwen 3, Qwen 2.5 Coder, and Qwen-VL. VRAM requirements, Ollama setup, thinking mode, and benchmarks vs Llama and DeepSeek.
Llama 3 Guide: Every Size from 1B to 405B
Complete Llama 3 guide covering every model from 1B to 405B. VRAM requirements, Ollama setup, benchmarks vs Qwen 3, and which size fits your hardware.
DeepSeek Models Guide: R1, V3, and Coder
Complete DeepSeek models guide covering R1, V3, and Coder locally. Which distilled R1 to pick for your GPU, VRAM requirements, and benchmarks vs Qwen 3.
Model Formats Explained: GGUF vs GPTQ vs AWQ vs EXL2
GGUF vs GPTQ vs AWQ vs EXL2 model formats explained. Learn what each format does, which tools support them, and how to choose the right one for your GPU.
Best Models Under 3B: Small LLMs That Work
The best models under 3B parameters for laptops, old GPUs, Raspberry Pi, and phones. What works, what doesn't, and which tiny LLM to pick for your use case.
Quantization Explained: What It Means for Local AI
Q4_K_M shrinks a 7B model from 14GB to ~4GB while keeping 90-95% quality. What every quantization format means, how much VRAM each saves, and which to pick for your GPU.