Llm-Inference

Best Way to Run 31B Models on a Laptop? Treat Them Like Databases
LARQL decompiles transformer weights into a queryable graph called a vindex. The project pitches a new shape for local inference: walk a subgraph, patch facts, stream from disk. Here's what's real, what's claimed, and what's still research.
Apr 21, 2026
Flash-MoE: Run a 397B Model on a 48GB Laptop (Here's How)
Flash-MoE streams Qwen3.5-397B from your SSD at 4.4 tok/s using 5.5GB of RAM. Pure C and Metal, no Python. Here's what's real, what's hype, and how to try it.
Mar 22, 2026
LocalAgent: A Local-First Agent Runtime That Actually Cares About Safety
Rust CLI for AI agents with deny-by-default permissions, approval workflows, and deterministic replay. Works with LM Studio, Ollama, and llama.cpp.
Feb 21, 2026
Best Dual-GPU Local AI Setup: RTX 3090, 5060 Ti (2026)
Dual RTX 3090, 2x RTX 5060 Ti, 2x 2080 Ti modded, mixed setups: real configs for Qwen 3.6, MoE, 70B. Tensor vs pipeline parallelism, llama.cpp/vLLM.
Feb 6, 2026