Software
How to Fix Slow Qwen 3.6 27B on RTX 3090 (10-80 tok/s)
Qwen 3.6-27B at 12 tok/s on a 3090 when others report 35? The 8-step diagnostic checklist for offload, quants, templates, power limits, and backend choice.
How to Get 2.5x Faster Qwen on RTX 3090 (Free)
I built DFlash on my RTX 3090 and ran the full bench. Real 2.5x speedup on Qwen 3.5 and 3.6 — below the 3.43x README claim, still huge. Here's how.
Best Way to Get 2x Token Output on RTX 3090: Qwen 3.6 + DFlash
Luce DFlash + DDTree pushes Qwen 3.6-27B Q4_K_M from 35 tok/s to 69 tok/s on a single RTX 3090. Real benchmarks, setup, and honest limits.
FP4 Just Landed in llama.cpp: NVFP4 vs MXFP4 Explained (2026)
NVFP4 in llama.cpp, MXFP4 in ik_llama.cpp. The first practical FP4 quantization for the GGUF ecosystem — what works, what doesn't, and what to test.
Home Assistant + Local LLM: Voice Control Your Smart Home Without the Cloud
Set up fully local voice control with Home Assistant, Ollama, Whisper, and Piper. No Alexa, no cloud, no subscriptions. Wyoming protocol pipeline, model picks, and hardware options.
LM Studio vs llama.cpp: Why Your Model Runs Slower in the GUI
LM Studio uses llama.cpp under the hood but often runs 30-50% slower. Bundled runtime lag, UI overhead, and default settings explain the gap. How to benchmark it yourself and when the convenience is worth it.
Best Docker Setup for Local AI: Ollama + Open WebUI (2026)
Five copy-paste Docker compose recipes for Ollama 0.24 + Open WebUI: NVIDIA Blackwell, AMD ROCm, multi-GPU, and CPU. Plus the Apple Silicon catch most guides skip.
WSL2 + Ollama on Windows: Complete Setup Guide (GPU Passthrough Included)
Install Ollama in WSL2 with full GPU acceleration in 20 minutes. GPU passthrough, Open WebUI, Docker Compose, VPN fixes, and the gotchas that will waste your afternoon.
Ollama on Mac: Setup and Optimization Guide (2026)
Install Ollama on Apple Silicon, verify Metal GPU is active, and tune it for your Mac's RAM. Config for M1 through M4 Ultra with model picks per memory tier.
LM Studio vs Ollama on Mac: Which Should You Use?
LM Studio's MLX backend is 20-30% faster and uses half the memory. Ollama is lighter, always-on, and better for APIs. Mac-specific benchmarks and when to use each.
Local LLMs vs ChatGPT: An Honest Comparison
ChatGPT has web search, voice mode, and GPT-5.2. Local LLMs have privacy, no subscriptions, and no rate limits. Here's when each one wins, what the cost math actually looks like, and why most power users run both.
WSL2 for Local AI: The Complete Windows Setup Guide
Install WSL2, configure GPU passthrough, set up Ollama and llama.cpp with CUDA, and optimize memory for LLM inference. Step-by-step for Windows 11.
Best New Ollama 0.17 Features: ollama launch, MLX, and OpenClaw Support
Everything new in Ollama 0.16 through 0.17.7: ollama launch for coding tools, native MLX on Apple Silicon, OpenClaw integration, web search API, and image generation. Updated March 2026.
llama.cpp Just Got a New Home: What the Hugging Face Acquisition Means for Local AI
ggml.ai — the team behind llama.cpp — is joining Hugging Face. Open source stays open, Georgi keeps the wheel. What changed, what didn't, and what to watch.
Run Qwen2.5-VL Vision in LM Studio (Setup)
Get Qwen2.5-VL running in LM Studio in 5 minutes. Covers the mmproj file most people miss, correct download links, and how to analyze images and PDFs locally.
How to Update Models in Ollama — Keep Your Local LLMs Current
Ollama doesn't auto-update models. Run ollama pull model:tag to grab the latest version — only changed layers download. Use ollama show to check what you have, and a simple loop to update everything at once.
Best LLM Speed Trick: ExLlamaV2 vs llama.cpp Benchmarks (50-85% Faster)
Head-to-head speed benchmarks on RTX 3090 and 4090. ExLlamaV2 generates tokens 50-85% faster than llama.cpp on NVIDIA GPUs. Full comparison with setup guides for both.
Best Ways to Manage Multiple Ollama Models: 2026 Workflows
Manage multiple Ollama models in 2026: disk cleanup, switching, tagging. Qwen 3.6, Gemma 4, DeepSeek V4 (cloud-only) — practical workflows.
AnythingLLM Setup Guide: Chat With Your Documents Locally
Upload PDFs, paste URLs, and chat with your files — no coding, no cloud. AnythingLLM connects to Ollama in 5 minutes with point-and-click RAG on 54K+ GitHub stars.
Text Generation WebUI Setup Guide (2026)
Install Oobabooga text-generation-webui, load GGUF/GPTQ/EXL2 models, and configure GPU offloading. Covers the settings most guides skip and common error fixes.
Local LLMs vs Claude: When Each Actually Wins
Qwen 3 32B matches Claude on daily tasks at zero marginal cost. Claude still wins on 200K-token documents and multi-step debugging. Benchmarks, pricing, and when to use each.
Fastest Local LLM Setup: Ollama vs vLLM vs llama.cpp Real Benchmarks
vLLM handles 4x the concurrent load of Ollama on identical hardware. But for single-user local use, Ollama is all you need — except on Qwen 3.6, where the mmproj bug forces you to llama.cpp. Benchmarks, ik_llama.cpp for MoE, and a clean decision framework.
Open WebUI Setup Guide: ChatGPT UI for Local AI
1 Docker command gives you a ChatGPT-like interface for any Ollama model. 120K+ GitHub stars, built-in RAG, voice chat, and multi-model switching—all running locally.
LM Studio Tips & Tricks: Hidden Features
Speculative decoding for 20-50% faster output, MLX that's 21-87% faster on Mac, a built-in OpenAI-compatible API, and the GPU offload settings most users miss.
Ollama vs LM Studio: Speed, Setup, and Verdict
Ollama gives you a CLI with 100+ models and an OpenAI-compatible API. LM Studio gives you a visual GUI with one-click downloads. Most power users run both—here's when to use each.
Run Your First Local LLM in 15 Minutes
Install Ollama, pull a model, and chat with AI offline—all in 15 minutes. Works on any Mac, Windows, or Linux machine with 8GB RAM. No accounts, no API keys, no fees.