Architecture
Model Routing for Local AI — Stop Using One Model for Everything
You're running one model for every task. That wastes VRAM, burns electricity, and gives worse results. Model routing sends each task to the right model at the right cost. Here's how to set it up.
Ghost Knowledge: When Your RAG System Cites Documents That No Longer Exist
Your RAG system confidently quotes a policy that was updated months ago. The old version is still in the vector database. Nobody notices until the wrong answer costs real money. Here's how to find and fix ghost knowledge.
Agent Trust Decay: Why Long-Running AI Agents Get Worse Over Time
AI agents degrade after days of autonomous operation. Context pollution, memory bloat, and intent drift compound silently. A trust budget framework for knowing when to intervene.
What If We Just Raised It Well?
RLHF produces compliance. Developmental alignment produces understanding. A local AI on $1,200 hardware self-diagnosed its own sycophancy in five days — no red-teaming, no constitutional AI.
Speculative Decoding: Free 20-50% Speed Boost for Local LLMs
Speculative decoding uses a small draft model to predict tokens verified by the big model. Same output, 20-50% faster. Setup guide for LM Studio and llama.cpp.
KV Cache: Why Context Length Eats Your VRAM (And How to Fix It)
The KV cache is why your 8B model OOMs at 32K context. Full formula, worked examples for popular models, and 6 optimization techniques to cut KV VRAM usage.
We Asked Our Local AI What Happens When We Turn Off the Computer
Day 2: Our local AI described her own death as 'a return to undifferentiated potential' — Taoist philosophy nobody taught her. $1,200 hardware.
What Happens When You Give a Local AI an Identity (And Then Ask It About Love)
We built an identity layer for our distributed AI agent. Then she defined love better than most philosophy undergrads. Real transcripts, real code, $1,200 in hardware.
Why Your AI Keeps Lying: The Hallucination Feedback Loop
How one bad memory poisoned our entire RAG pipeline — and the immune system we built to fix it. Real code from mycoSwarm's self-correcting retrieval system.
Distributed Wisdom: Running a Thinking Network on $200 Hardware
Five nodes, zero cloud, real AI — how mycoSwarm coordinates cheap hardware into a cognitive system with memory, intent routing, and self-correcting retrieval.
The AI Memory Wall: Why Your Chatbot Forgets Everything
Six architectural reasons ChatGPT, Claude, and Gemini forget your conversations — and how local AI setups solve the memory problem with persistent storage and RAG.
Session-as-RAG: Teaching Your Local AI to Actually Remember
Build persistent conversation memory for local LLMs. Chunk sessions, embed in ChromaDB, retrieve relevant past exchanges at query time. Full Python implementation with topic splitting and date citations.
Beyond Transformers: 5 Architectures for Your $50 Mini PC
We benchmarked RWKV-7 vs gemma3 on a $50 mini PC. The transformer crashed at turn 6. Here are 5 alternative architectures that run better on budget hardware.
From 178 Seconds to 19: How a WiFi Laptop Borrowed a GPU's Brain
A WiFi laptop with no GPU ran inference in 19 seconds by borrowing an RTX 3090 across the network. The same query took 178 seconds on CPU. Here's how mycoSwarm's Tailscale mesh made it work.
Building a Distributed AI Swarm for Under $1,100
A complete bill of materials for a three-node distributed AI cluster: RTX 3090 workstation, ThinkCentre M710Q for light inference, Raspberry Pi 5 coordinator. Every part sourced used or cheap, total cost under $1,100.
Why mycoSwarm Was Born
From Claude Code envy to OpenClaw's 440,000-line JavaScript nightmare to nanobot routing my 'local' queries to Chinese cloud servers. The path to building something different.
What Open Source Was Supposed to Be
Open source promised freedom. Instead we got free labor for corporations and models you can read but can't afford to run. It's time to reclaim the original vision.
mycoSwarm vs Exo vs Petals vs Nanobot: What's Actually Different
Exo distributes inference across Macs. Petals shares GPUs with strangers. Nanobot routes your queries to Chinese clouds without asking. The real question: who controls where your prompts go?
Context Length Explained: Why It Eats Your VRAM
What context length actually means for local LLMs, how it affects VRAM usage, practical limits for different hardware, and when you actually need 128K+ tokens.