MoE
Wicked Fast Gemma 4 vs Qwen 3.6 on RTX 3090: 3.10x Tested
Same RTX 3090, same llama.cpp build, same bench. Gemma 4 26B-A4B Q4_K_XL: 128 tok/s mean. Qwen 3.6-27B Q4_K_M: 41 tok/s. 3.10x faster, firsthand.
Best Way to Run Qwen 3.6 35B MoE Locally: VRAM, Speed, Setup
Qwen 3.6-35B-A3B has 35B total params but only 3B active per token. Real tok/s on RTX 3090, 4090, 5070 Ti, dual 5060 Ti, and M3 Ultra. Quants and setup.
FP4 Just Landed in llama.cpp: NVFP4 vs MXFP4 Explained (2026)
NVFP4 in llama.cpp, MXFP4 in ik_llama.cpp. The first practical FP4 quantization for the GGUF ecosystem — what works, what doesn't, and what to test.
DeepSeek V4 Flash vs Pro: What Actually Dropped and How to Run It
DeepSeek V4 preview dropped April 23 with two MoE variants: Pro at 1.6T/49B active and Flash at 284B/13B active. Both MIT, both 1M context. Flash is the news.
Qwen 3.6 Complete Guide: 27B Dense, 35B-A3B MoE, and Which to Use
Qwen 3.6 landed in two open-weight flavors: 27B dense and 35B-A3B MoE. Benchmarks, hardware fit, and which variant to run on your GPU.
Gemma 4 Just Dropped: What Local AI Builders Need to Know
Google's Gemma 4 is here -- dense and MoE variants, Apache 2.0, multimodal with vision and audio. VRAM requirements, benchmarks, and how it compares to Qwen 3.5.
DeepSeek V4: Everything We Know Before It Drops
DeepSeek V4 launches next week with native image and video generation, 1M context, and rumored 1T MoE params with only 32B active. Here's what local AI builders need to know and how to prepare.
Best Qwen 3.5 Models Ranked: Every Size, Every GPU, Every Quant
Complete ranking of all Qwen 3.5 models from 0.8B to 397B. VRAM requirements, speed benchmarks, and which model to pick for your hardware.
Qwen 3.5 Locally — 27B vs 35B-A3B vs 122B, Which Model Fits Your GPU
Qwen 3.5 and 3.6 on local hardware. 27B dense vs 35B-A3B MoE vs 122B compared. VRAM tables, community tok/s on RTX 3090, and which to pick for your card.
LiquidAI LFM2: The First Hybrid Model Built for Your Hardware
LFM2-24B-A2B runs at 112 tok/s on CPU with only 2.3B active params. Not a transformer. GGUF files from 13.5GB, Ollama and llama.cpp setup, and where it beats Qwen.
Best Way to Run Qwen 3.5 on Mac: MLX vs Ollama Speed Test
MLX runs Qwen 3.5 up to 2x faster than Ollama on Apple Silicon. Head-to-head benchmarks on M1 through M4, with setup instructions for both.
Best Qwen 3.5 Setup: Which Model Fits Your GPU (Complete Cheat Sheet)
Pick the right Qwen 3.5 or 3.6 model for your hardware. Covers 0.8B through 397B with VRAM requirements, quant recommendations, and benchmarks for every GPU tier. Updated April 2026 with Qwen 3.6-35B-A3B coverage.
MoE Models Explained: Why Mixtral Uses 46B Parameters But Runs Like 13B
Mixture of Experts explained for local AI — why MoE models run fast but still need full VRAM. Mixtral, DeepSeek V3, DBRX compared with dense model alternatives.
Mixtral VRAM Requirements: 8x7B and 8x22B at Every Quantization Level
Mixtral 8x7B has 46.7B params but only 12.9B activate per token. You still need VRAM for all 46.7B. Exact VRAM for every quant from Q2 to FP16.
Qwen3 Complete Guide: Every Model from 0.6B to 235B
Qwen3 is the best open model family for budget local AI. Dense models from 0.6B to 32B, MoE models that punch above their weight, and a /think toggle no one else has.
Llama 4 Guide: Running Scout and Maverick Locally (2026)
Complete Llama 4 Scout (109B MoE) and Maverick guide for local AI. VRAM, Ollama and vLLM setup, hardware reality, and how it stacks against Qwen 3.6.
GPT-OSS Guide: OpenAI's First Open Model for Local AI
GPT-OSS 20B is OpenAI's first open-weight model. MoE with 3.6B active params, MXFP4 at 13GB, 128K context, Apache 2.0. Here's how to run it.
DeepSeek V3.2 Guide: What Changed and How to Run It Locally
DeepSeek V3.2 competes with GPT-5 on benchmarks. The full model needs 350GB+ VRAM. But the R1 distills run on a $200 used GPU — and they're shockingly good.
Best Dual-GPU Local AI Setup: RTX 3090, 5060 Ti (2026)
Dual RTX 3090, 2x RTX 5060 Ti, 2x 2080 Ti modded, mixed setups: real configs for Qwen 3.6, MoE, 70B. Tensor vs pipeline parallelism, llama.cpp/vLLM.
Mixtral 8x7B & 8x22B VRAM Requirements
Mixtral 8x7B and 8x22B VRAM requirements at every quantization level. Exact numbers from Q2 to FP16, GPU recommendations, and KV cache impact explained.
Are Mistral Models Still Worth Running? Only Nemo 12B (Here's Why)
Mistral Medium 3.5-128B dropped April 29, 2026: dense 128B, 256k context, Modified MIT. Hardware reality, license caveats, which Mistral to actually run.