MoE

How to Run GLM 5.2 Locally: GPU, VRAM & Quant Guide
GLM 5.2 is 753B params and 1.51TB at full precision. Run it locally: the live Unsloth quant ladder, every GPU and RAM path, and the quant to actually target.
Jun 21, 2026
Wicked Fast Gemma 4 vs Qwen 3.6 on RTX 3090: 3.10x Tested
Same RTX 3090, same llama.cpp build, same bench. Gemma 4 26B-A4B Q4_K_XL: 128 tok/s mean. Qwen 3.6-27B Q4_K_M: 41 tok/s. 3.10x faster, firsthand.
May 8, 2026
Best Way to Run Qwen 3.6 35B MoE Locally: VRAM, Speed, Setup
Qwen 3.6-35B-A3B has 35B total params but only 3B active per token. Real tok/s on RTX 3090, 4090, 5070 Ti, dual 5060 Ti, and M3 Ultra. Quants and setup.
Apr 28, 2026
FP4 Just Landed in llama.cpp: NVFP4 vs MXFP4 Explained (2026)
NVFP4 in llama.cpp, MXFP4 in ik_llama.cpp. The first practical FP4 quantization for the GGUF ecosystem — what works, what doesn't, and what to test.
Apr 25, 2026
DeepSeek V4 Flash vs Pro: What Actually Dropped and How to Run It
DeepSeek V4 preview dropped April 23 with two MoE variants: Pro at 1.6T/49B active and Flash at 284B/13B active. Both MIT, both 1M context. Flash is the news.
Apr 24, 2026
Qwen 3.6 Complete Guide: 27B Dense, 35B-A3B MoE, and Which to Use
Qwen 3.6 landed in two open-weight flavors: 27B dense and 35B-A3B MoE. Benchmarks, hardware fit, and which variant to run on your GPU.
Apr 24, 2026
Gemma 4 Just Dropped: What Local AI Builders Need to Know
Google's Gemma 4 is here -- dense and MoE variants, Apache 2.0, multimodal with vision and audio. VRAM requirements, benchmarks, and how it compares to Qwen 3.5.
Apr 2, 2026
RTX 5090 Benchmarks: 5090 vs 4090 vs Used 3090 (2026)
5090 community benches across 4K-131K context, prompt-processing tables, 5090-vs-4090 upgrade math, and InsiderLLM's firsthand 3090 honest-value anchor.
Mar 25, 2026
DeepSeek V4: Everything We Know Before It Drops
DeepSeek V4 launches next week with native image and video generation, 1M context, and rumored 1T MoE params with only 32B active. Here's what local AI builders need to know and how to prepare.
Feb 28, 2026
Best Qwen 3.5 Models Ranked: Every Size, Every GPU, Every Quant
Complete ranking of all Qwen 3.5 models from 0.8B to 397B. VRAM requirements, speed benchmarks, and which model to pick for your hardware.
Feb 28, 2026
Qwen 3.5 Locally — 27B vs 35B-A3B vs 122B, Which Model Fits Your GPU
Qwen 3.5 and 3.6 on local hardware. 27B dense vs 35B-A3B MoE vs 122B compared. VRAM tables, community tok/s on RTX 3090, and which to pick for your card.
Feb 26, 2026
LiquidAI LFM2: The First Hybrid Model Built for Your Hardware
LFM2-24B-A2B runs at 112 tok/s on CPU with only 2.3B active params. Not a transformer. GGUF files from 13.5GB, Ollama and llama.cpp setup, and where it beats Qwen.
Feb 26, 2026
Best Way to Run Qwen 3.5 on Mac: MLX vs Ollama Speed Test
MLX runs Qwen 3.5 up to 2x faster than Ollama on Apple Silicon. Head-to-head benchmarks on M1 through M4, with setup instructions for both.
Feb 26, 2026
Best Qwen 3.5 Setup: When to Stay vs Move to 3.6 (2026)
3.5 is Qwen's stable open workhorse — 3.6 replaced only two tiers, 3.7 went closed. Which 3.5 model on which GPU, when to stay vs move to 3.6.
Feb 25, 2026
MoE Models Explained: Why Mixtral Uses 46B Parameters But Runs Like 13B
Mixture of Experts explained for local AI — why MoE models run fast but still need full VRAM. Mixtral, DeepSeek V3, DBRX compared with dense model alternatives.
Feb 23, 2026
Mixtral VRAM Requirements: 8x7B and 8x22B at Every Quantization Level
Mixtral 8x7B has 46.7B params but only 12.9B activate per token. You still need VRAM for all 46.7B. Exact VRAM for every quant from Q2 to FP16.
Feb 17, 2026
Qwen3 Complete Guide: Every Model from 0.6B to 235B
Qwen3 is the best open model family for budget local AI. Dense models from 0.6B to 32B, MoE models that punch above their weight, and a /think toggle no one else has.
Feb 16, 2026
Llama 4 Guide: Running Scout and Maverick Locally (2026)
Complete Llama 4 Scout (109B MoE) and Maverick guide for local AI. VRAM, Ollama and vLLM setup, hardware reality, and how it stacks against Qwen 3.6.
Feb 16, 2026
GPT-OSS Guide: OpenAI's First Open Model for Local AI
GPT-OSS 20B is OpenAI's first open-weight model. MoE with 3.6B active params, MXFP4 at 13GB, 128K context, Apache 2.0. Here's how to run it.
Feb 16, 2026
DeepSeek V3.2 Guide: What Changed and How to Run It Locally
DeepSeek V3.2 was the Feb 2026 flagship — V4 now leads. But the R1-Distill models run on a $200 used GPU and remain the local reasoning pick.
Feb 16, 2026
Best Dual-GPU Local AI Setup: RTX 3090, 5060 Ti (2026)
Dual RTX 3090, 2x RTX 5060 Ti, 2x 2080 Ti modded, mixed setups: real configs for Qwen 3.6, MoE, 70B. Tensor vs pipeline parallelism, llama.cpp/vLLM.
Feb 6, 2026
Mixtral 8x7B & 8x22B VRAM Requirements
Mixtral 8x7B and 8x22B VRAM requirements at every quantization level. Exact numbers from Q2 to FP16, GPU recommendations, and KV cache impact explained.
Feb 5, 2026
Are Mistral Models Still Worth Running? Only Nemo 12B (Here's Why)
Mistral Medium 3.5-128B dropped April 29, 2026: dense 128B, 256k context, Modified MIT. Hardware reality, license caveats, which Mistral to actually run.
Feb 3, 2026