Replace GitHub Copilot With Local LLMs in VS Code — Free, Private, No Subscription
More on this topic: Best Models for Coding Locally · llama.cpp vs Ollama vs vLLM · Local Alternatives to Claude Code · VRAM Requirements
GitHub Copilot costs $10/month for individuals and $19/month for business. Every keystroke, every prompt, every line of code goes to Microsoft’s servers. Hit rate limits during peak hours? That spinning cursor is Copilot throttling you.
Local LLMs flip all of that. Code stays on your machine. No subscription, no rate limits, no internet required. The quality gap has closed. Qwen 2.5 Coder 32B hits 92.9% on HumanEval, matching GPT-4o. The 7B variant scores 88.4% and runs on an 8GB GPU. And Qwen3-Coder-Next — released February 2026 — scores 70.6% on SWE-Bench Verified with only 3B active parameters, putting agentic coding within reach of a single consumer GPU.
Below: the VS Code extensions that work, a step-by-step Continue + Ollama setup, and which models to run at every VRAM tier.
Why replace Copilot
Your code stays on your machine. Copilot sends code context to GitHub’s servers with every request. Proprietary code, client projects, anything under NDA — that’s a liability. Local inference means nothing leaves your hardware.
No recurring cost. $10/month is $120/year. Business tier at $19/month is $228/year. Multiply across a team. Local models cost nothing to run once you own the GPU — and if you’re a developer with a gaming card, you already do.
No rate limits. Microsoft has tightened Copilot limits repeatedly since 2025. Business tier users report slower completions during peak hours. Your local model runs at the same speed at 3am or 3pm.
Works offline. Planes, air-gapped environments, coffee shops with dead WiFi. Copilot goes silent without internet. Local models don’t care.
The cost math
| Copilot Individual | Copilot Business | Local (8GB GPU) | Local (24GB GPU) | |
|---|---|---|---|---|
| Monthly cost | $10 | $19 | $0 | $0 |
| Year 1 cost | $120 | $228 | $0 (own GPU) or ~$200 (used RTX 3060) | $0 (own GPU) or ~$350 (used RTX 3090) |
| Year 2 cost | $240 | $456 | $0 | $0 |
| Year 3 cost | $360 | $684 | $0 | $0 |
| Privacy | Code sent to Microsoft | Code sent to Microsoft | On your machine | On your machine |
| Rate limits | Yes | Yes | No | No |
| Works offline | No | No | Yes | Yes |
A used RTX 3060 12GB runs $170-200. A used RTX 3090 runs $300-400. Either pays for itself in under two years versus Copilot, and you get a GPU that handles gaming, image generation, and LLM inference for everything else too.
The honest tradeoff: Copilot is faster. Cloud GPUs outperform consumer cards, and Copilot’s completions arrive in 200-400ms. Local models on a 3090 take 500ms-2s depending on size. For most developers, that’s fast enough. But if you need instant completions on every keystroke, cloud wins on raw latency.
Three tasks, three models
This is the part most guides get wrong. They recommend one model for everything. Autocomplete, chat, and agentic coding are different tasks with different requirements.
| Task | What it does | Best model | Why |
|---|---|---|---|
| Autocomplete (FIM) | Ghost-text tab completions as you type | Qwen 2.5 Coder 1.5B-7B | FIM-trained, fast, made for fill-in-the-middle |
| Chat / reasoning | Explain code, write tests, refactor, answer questions | Qwen 3.5 9B or 27B | Better reasoning, 262K context, multimodal (reads screenshots) |
| Agentic coding | Autonomous file edits, terminal commands, multi-step tasks | Qwen3-Coder-Next (80B/3B active) | 70.6% SWE-Bench, designed for agent workflows |
Autocomplete needs a FIM-trained model. Fill-in-the-middle means the model sees code before and after your cursor, then predicts the gap. Regular chat models can’t do this. Qwen 2.5 Coder was trained specifically for it at every size from 0.5B to 32B.
Chat needs reasoning and instruction following. When you ask “refactor this function” or “explain this error,” you want Qwen 3.5’s reasoning ability and its 262K context window. Paste a screenshot of a stack trace and Qwen 3.5 reads it — Copilot can’t do that. As of March 2026, Qwen 3.5 9B is the community default for local coding chat — it tops every sub-10B benchmark and fits on 8GB VRAM. The 9B setup guide covers it in detail.
Agentic coding needs architecture-level understanding. Qwen3-Coder-Next was trained on 800,000 real GitHub PRs with executable test environments. It can plan multi-file changes, run tests, and iterate. The 3B active parameters keep it fast; the 80B total give it depth.
Speed matters more for autocomplete. You tolerate a 5-second wait for a chat response. You don’t tolerate it for every tab completion. Use a smaller, faster model for autocomplete (1.5B-7B) and a larger model for chat and agentic work.
The extensions, ranked
1. Continue — best overall
| Detail | Info |
|---|---|
| GitHub Stars | ~31,700 |
| License | Apache 2.0 |
| Ollama Support | Native, first-class |
| Autocomplete | Yes (FIM-based tab completion) |
| Chat | Yes (sidebar + inline) |
| Agent Mode | Yes (file edits, terminal, MCP) |
| Latest Version | v1.3.31 (Feb 2026) |
Continue is the clear winner for local coding in VS Code. Native Ollama integration, separate models for autocomplete and chat, and full agent mode with MCP (Model Context Protocol) support.
Recent updates added MCP server loading from config files with environment variable templating, a PR inbox for reviewing pull requests inside VS Code, and “just-in-time” additional instructions for agent sessions. Continue also migrated from config.json to config.yaml — the new format uses a roles field to assign models to tasks (chat, autocomplete, edit, apply). The YAML configs throughout this guide already use the new format.
The multi-model setup is what makes it work. Run Qwen 2.5 Coder 1.5B for autocomplete (fast, small) and Qwen 3.5 9B for chat (smart, multimodal). Autocomplete stays snappy while complex questions get the big model’s full attention.
Context providers close much of the gap with Copilot’s cloud-powered context: @codebase searches your whole repo via local embeddings, @file references specific files, @terminal pulls recent terminal output.
Agent mode lets the LLM edit files and run terminal commands autonomously. Point it at Qwen3-Coder-Next and you get something closer to Claude Code than traditional autocomplete — running entirely on your hardware.
Who it’s for: Anyone replacing Copilot. It’s the one I’d install first.
2. Tabby — best for teams
| Detail | Info |
|---|---|
| GitHub Stars | ~33,000 |
| License | Apache 2.0 (core) + proprietary EE |
| Ollama Support | Yes (HTTP connector) |
| Autocomplete | Yes (FIM, server-based) |
| Chat | Yes + Answer Engine |
| Self-Hosted | Yes (Rust server) |
Tabby takes a different approach: you run a self-hosted server, and the VS Code extension connects to it. The server handles model inference, codebase indexing, and team management.
That architecture works well for teams. Set up one GPU server and every developer gets completions from it. The Answer Engine indexes your internal docs and codebase for RAG-powered answers. LDAP auth and usage analytics come built in.
Setup cost is higher than Continue. You’re running a Rust server, configuring backends, managing infrastructure. For a solo developer, that’s overkill. For a team of 5-10 developers sharing one beefy GPU server, it pays for itself on day one versus Copilot Business licenses.
Who it’s for: Teams who want shared infrastructure. Solo devs should use Continue.
3. CodeGPT — simplest setup
| Detail | Info |
|---|---|
| VS Code Installs | ~2.29 million |
| License | Freemium (local features free) |
| Ollama Support | Yes, native |
| Autocomplete | Yes |
| Chat | Yes (slash commands) |
CodeGPT has the largest install base of any AI coding extension outside Copilot. Install the extension, select Ollama, pick a model, and it works.
Built-in slash commands (/Fix, /Document, /Refactor, /Unit Testing) are convenient. The downside: you can’t separately assign models for autocomplete versus chat, and the advanced context features aren’t as developed as Continue’s. Some features push toward their cloud platform.
Who it’s for: Developers who want the simplest possible setup and don’t need advanced configuration.
4. Avante.nvim — for Neovim users
| Detail | Info |
|---|---|
| GitHub Stars | ~17,500 |
| License | Apache 2.0 |
| Ollama Support | Yes (custom provider config) |
| Chat | Yes (sidebar panel) |
| Agent Mode | Yes (apply suggestions, edit in place) |
Avante.nvim brings Cursor-style AI chat to Neovim. Select code, open the panel, describe what you want, get a diff you can accept or reject.
Avante doesn’t do tab autocomplete — pair it with cmp-ai pointed at Ollama for that. What it handles well is the “refactor this,” “explain this,” “write a test” workflow.
Who it’s for: Neovim users who want AI-assisted editing without leaving the terminal.
5. Twinny — lightweight and focused
| Detail | Info |
|---|---|
| GitHub Stars | ~5,000 |
| License | MIT |
| Ollama Support | Native |
| Autocomplete | Yes (FIM) |
| Chat | Yes (sidebar) |
| Agent Mode | No |
Twinny is the “just autocomplete and chat, nothing else” option. Install it, point it at Ollama, and it works. No YAML config migration, no MCP framework, no PR inbox. That’s the appeal — it does two things and does them without fuss.
It connects to Ollama, llama.cpp, LM Studio, or any OpenAI-compatible endpoint. There’s also a peer-to-peer inference network called Symmetry that lets you share GPU resources across machines, though most users stick with local Ollama.
Who it’s for: Developers who find Continue too heavy and just want FIM completions plus a chat sidebar.
6. llama.vscode — from the llama.cpp team
| Detail | Info |
|---|---|
| License | MIT |
| Ollama Support | No (uses llama.cpp server directly) |
| Autocomplete | Yes (FIM) |
| Chat | Yes |
| Agent Mode | Yes |
| Latest Version | v0.0.42 (Jan 2026) |
llama.vscode is built by the ggml/llama.cpp team. It skips Ollama entirely and connects straight to a llama.cpp server, which means one less layer and slightly lower overhead on resource-limited machines.
Designed for consumer hardware with no internet required. If you’re already running llama.cpp for other things and don’t want the Ollama wrapper, this is the native option.
Who it’s for: llama.cpp users who prefer to skip the Ollama abstraction layer.
7. Kilo Code — agentic alternative to Continue
| Detail | Info |
|---|---|
| License | Apache 2.0 |
| Ollama Support | Yes |
| Autocomplete | No |
| Chat | Yes |
| Agent Mode | Yes (multi-mode: Architect, Coder, Debugger) |
Kilo Code is a newer agentic coding extension with multiple built-in modes — Architect for planning, Coder for implementation, Debugger for troubleshooting. It reads file structure, runs terminal commands, and can launch a browser for UI debugging.
It connects to Ollama and LM Studio for local models. Unlike Continue, it doesn’t do tab autocomplete — it’s purely an agent that reasons about your codebase and makes changes. Pair it with Twinny or llama.vscode for FIM completions.
Who it’s for: Developers who want agentic coding specifically and find Continue’s all-in-one approach too broad.
8. Cody — dead for individuals
Sourcegraph discontinued Cody’s Free and Pro plans in July 2025. New individual signups are blocked. Its Ollama support was experimental and limited — autocomplete context was restricted to the current file only. Skip this unless you’re on an enterprise Sourcegraph contract.
Quick comparison
| Extension | Best For | Ollama | Tab Complete | Chat | Agent | Status |
|---|---|---|---|---|---|---|
| Continue | Solo devs, all-around | Native | Yes (FIM) | Yes | Yes | Active |
| Tabby | Teams, self-hosted | Yes | Yes (FIM) | Yes | No | Active |
| CodeGPT | Simplest setup | Native | Yes | Yes | No | Active |
| Twinny | Lightweight, minimal config | Native | Yes (FIM) | Yes | No | Active |
| llama.vscode | llama.cpp users, no Ollama | Direct llama.cpp | Yes (FIM) | Yes | Yes | Active |
| Kilo Code | Agentic coding | Yes | No | Yes | Yes | Active |
| Avante.nvim | Neovim users | Custom provider | No | Yes | Yes | Active |
| Cody | Enterprise only | Was experimental | — | — | — | Dead for individuals |
Setup: Continue + Ollama (10 minutes)
This is the setup I recommend for most developers. Runs on 8GB VRAM, gives you tab completion and chat, and takes about 10 minutes.
Prerequisites
- VS Code installed
- A GPU with at least 6GB VRAM (RTX 3060, 4060, or equivalent)
- Ollama 0.17.4+ (required for Qwen 3.5 — check with
ollama --version) - ~5-10GB free disk space for models
Step 1: Install Ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS
brew install ollama
# Windows — download from https://ollama.com/download
Verify it’s running:
ollama --version
If you hit issues, check our Ollama troubleshooting guide.
Step 2: Pull your models
You need two models: a FIM model for tab autocomplete, and a chat model for conversations.
# Autocomplete model (FIM-trained, fast)
ollama pull qwen2.5-coder:7b-instruct-q4_K_M # ~4.5GB, best quality/speed balance
# Chat model (pick one)
ollama run qwen3.5:9b # ~5GB, multimodal, 262K context
On 8GB VRAM, Ollama swaps between models — a few seconds of delay when switching from tab-complete to chat. On 12GB+, both fit simultaneously.
Step 3: Install Continue in VS Code
- Open VS Code
- Extensions (Ctrl+Shift+X)
- Search “Continue”
- Install “Continue - Codestral, Claude, and more” by Continue
- The Continue sidebar panel appears on the left
Step 4: Configure Continue
Open ~/.continue/config.yaml (Continue sidebar → gear icon):
models:
- name: Qwen 3.5 9B
provider: ollama
model: qwen3.5:9b
roles:
- chat
- edit
tabAutocompleteModel:
provider: ollama
model: qwen2.5-coder:7b-instruct-q4_K_M
contextProviders:
- name: codebase
- name: file
- name: terminal
This gives you Qwen 3.5 9B for chat (reasoning, multimodal, 262K context) and Qwen 2.5 Coder 7B for FIM autocomplete.
Step 5: Test it
Open any code file. Start typing a function and pause — ghost text suggestions appear. Press Tab to accept.
Open the Continue sidebar (Ctrl+L) and ask: “Explain this file” or “Write a test for this function.”
What to expect
- Tab completions: 100-300ms on a modern GPU. Nearly instant for the 1.5B model.
- Chat responses: 2-5 seconds for first tokens from the 9B, then streaming at 30-40 tok/s on a 3060 12GB.
- Quality: Handles single-function completions, docstrings, simple refactors well. Struggles with complex multi-file reasoning — that’s what agentic mode is for.
Agentic coding: Qwen3-Coder-Next
Autocomplete and chat are solved problems at this point. The interesting part now is agentic coding — the model reads your codebase, plans changes across multiple files, runs tests, and iterates until the task passes.
What Qwen3-Coder-Next is
Released February 2026, Qwen3-Coder-Next is an 80B-parameter MoE model with only 3B active per token. That means it runs on 24GB VRAM while accessing far more knowledge than a traditional 3B model.
| Spec | Value |
|---|---|
| Total parameters | 80B |
| Active per token | 3B |
| Context window | 262K tokens |
| SWE-Bench Verified | 70.6% (via SWE-Agent) |
| Aider benchmark | 66.2% |
| VRAM (Q4) | ~20GB |
| Training data | 800,000 real GitHub PRs with executable tests |
For comparison, GPT-4o scores ~64% on SWE-Bench Verified. Qwen3-Coder-Next beats it with 3B active parameters running on your desk.
Setup in Continue
# Pull the model (~20GB download)
ollama pull qwen3-coder-next
Add it to your Continue config as an agent model:
models:
- name: Qwen3-Coder-Next
provider: ollama
model: qwen3-coder-next
roles:
- chat
- edit
- name: Qwen 3.5 9B
provider: ollama
model: qwen3.5:9b
roles:
- chat
tabAutocompleteModel:
provider: ollama
model: qwen2.5-coder:7b-instruct-q4_K_M
contextProviders:
- name: codebase
- name: file
- name: terminal
Now you can switch between Qwen 3.5 9B for quick chat and Qwen3-Coder-Next for agentic tasks that need multi-file reasoning. Continue’s agent mode lets it edit files, run terminal commands, and iterate on test failures.
When to use agentic vs chat
Use chat (Qwen 3.5 9B) for: explaining code, writing a single function, generating docs, answering questions. Fast, fits on 8GB.
Use agentic (Qwen3-Coder-Next) for: implementing a feature across multiple files, refactoring an interface and updating all callers, debugging a test failure that spans several modules. Needs 24GB, slower, but handles the work that 9B models can’t.
OmniCoder-9B: agentic coding on 8GB
If 24GB for Qwen3-Coder-Next is out of reach, look at OmniCoder-9B. It’s a Qwen 3.5 9B fine-tune trained on 425,000+ agentic coding trajectories — the kind of multi-step scaffolding patterns used by Claude Code and similar tools. It scores 23.6% on Terminal-Bench 2.0, a 61% improvement over the base Qwen 3.5 9B. That’s not Qwen3-Coder-Next territory, but it fits on 8GB VRAM and actually tries to recover from errors, use edit diffs instead of full rewrites, and respond to LSP diagnostics. Available on Ollama (ollama pull carstenuhlig/omnicoder-9b) and as GGUF on Hugging Face. Apache 2.0 license.
A note on Qwen 3.5 35B-A3B
You might see people recommending the Qwen 3.5 35B-A3B (also 3B active) for coding. It’s fast — 60-100+ tok/s. But the 3B active parameters are spread across a general-purpose model, not a code-specialized one. For agentic coding tasks requiring architectural reasoning, the dense Qwen 3.5 27B or the code-specialized Qwen3-Coder-Next both perform better. The 35B-A3B is good for fast chat, not deep coding.
24GB setup: the full stack
If you have an RTX 3090 or 4090, here’s the three-model config that covers everything:
models:
- name: Qwen3-Coder-Next (Agent)
provider: ollama
model: qwen3-coder-next
roles:
- chat
- edit
- name: Qwen 3.5 9B (Quick Chat)
provider: ollama
model: qwen3.5:9b
roles:
- chat
tabAutocompleteModel:
provider: ollama
model: qwen2.5-coder:7b-instruct-q4_K_M
contextProviders:
- name: codebase
- name: file
- name: terminal
- name: git
Ollama swaps models as you use them. Autocomplete stays on the 7B (fast, always responsive). Switch to Qwen 3.5 9B for quick questions, Qwen3-Coder-Next for heavy lifting. The swap takes a few seconds but you only pay it when switching tasks.
What works well locally
Tab completion is where local shines brightest. FIM-trained models like Qwen 2.5 Coder produce fast, accurate completions for single-line and short multi-line suggestions. Closest experience to Copilot.
Docstrings, comments, and simple refactors work well too. The code is right there in context. Rename variables, extract functions, convert a loop to a comprehension, add type hints. Single-file, well-scoped changes are reliable with 7B+ models.
Explaining unfamiliar code is solid. Paste it in, ask what it does. The full context is in the prompt, so the model doesn’t need to reason across your whole project.
Screenshot reading is where local actually beats Copilot. Qwen 3.5 is natively multimodal. Paste a screenshot of a stack trace or error dialog and the model reads it. Copilot can’t do this at all.
What still struggles locally
Large codebase context is the biggest gap. Copilot’s cloud infrastructure processes context across your workspace. Local models are limited by their context window and VRAM. Continue’s @codebase provider does RAG over your repo, but it’s not the same as having the full project in context.
Multi-file edits with small models are hit or miss. Ask a 7B to refactor an interface and update all implementations, and you’ll get inconsistent results. Qwen3-Coder-Next handles this better, but it needs 24GB.
Speed on complex prompts goes to cloud. A 32B model takes 10-15 seconds for a full response. Copilot returns similar quality in 2-3 seconds. For rapid-fire Q&A, the latency adds up.
Uncommon languages are thin. Qwen 2.5 Coder is strong in Python, TypeScript, Java, C++, Go, and Rust. For Elixir, Haskell, or very new frameworks, quality drops noticeably.
Model recommendations by VRAM
| VRAM | Autocomplete | Chat | Agentic | What you get |
|---|---|---|---|---|
| 6-8GB | Qwen 2.5 Coder 1.5B (Q4) | Qwen 3.5 9B (Q4, ~5GB) | OmniCoder-9B (Q4, ~5GB) | FIM autocomplete + multimodal chat + basic agentic. Models swap. |
| 12GB | Qwen 2.5 Coder 7B (Q4) | Qwen 3.5 9B (Q4) | — | Both fit in VRAM simultaneously. Strong combo. |
| 16GB | Qwen 2.5 Coder 7B (Q4) | Qwen 3.5 9B (Q6-Q8) | — | Higher quality chat. Room for long context. |
| 24GB | Qwen 2.5 Coder 7B (Q4) | Qwen 3.5 9B (Q4) | Qwen3-Coder-Next (Q4, ~20GB) | Full stack. Swap between chat and agent. |
| 24GB (alt) | Qwen 2.5 Coder 7B (Q4) | Qwen 2.5 Coder 32B (Q4) | — | 92.9% HumanEval. Better for pure code, no vision. |
Pull commands by tier
8GB VRAM:
ollama pull qwen2.5-coder:1.5b-instruct-q4_K_M # autocomplete
ollama run qwen3.5:9b # chat (swaps with autocomplete)
12-16GB VRAM:
ollama pull qwen2.5-coder:7b-instruct-q4_K_M # autocomplete
ollama run qwen3.5:9b # chat (both fit simultaneously on 12GB+)
24GB VRAM:
ollama pull qwen2.5-coder:7b-instruct-q4_K_M # autocomplete
ollama run qwen3.5:9b # quick chat
ollama pull qwen3-coder-next # agentic coding (swap with chat)
Performance: local vs Copilot
| Metric | Copilot (Cloud) | Local 8GB | Local 24GB |
|---|---|---|---|
| Tab completion latency | 200-400ms | 300-600ms (Coder 1.5B) | 300-500ms (Coder 7B) |
| Chat first-token | 1-2s | 2-4s (9B) | 2-4s (9B) |
| Chat throughput | 50-80 tok/s | 30-40 tok/s | 30-40 tok/s |
| Agentic (SWE-Bench) | ~64% (GPT-4o) | — | 70.6% (Qwen3-Coder-Next) |
| Multi-file context | Full workspace | RAG-based | RAG-based |
| Multimodal input | No | Yes (Qwen 3.5) | Yes (Qwen 3.5) |
| Monthly cost | $10-19 | $0 | $0 |
| Code privacy | Sent to Microsoft | On your machine | On your machine |
| Works offline | No | Yes | Yes |
| Rate limits | Yes (tightening) | No | No |
Local wins on privacy, cost, multimodal, and agentic benchmarks. Copilot wins on latency and multi-file context. Most developers who switch keep both for a week, then drop Copilot once they realize they’re not missing much.
Troubleshooting
Autocomplete not showing up? Make sure Ollama is running (ollama serve or check the system tray). Verify the model name in config.yaml matches what you pulled. Open the Continue output panel (View → Output → Continue) for error messages.
Completions are slow? Run ollama ps to check memory usage. If the model doesn’t fully fit in VRAM, inference drops to CPU speed. Use a smaller model or more aggressive quantization (Q3_K_M).
Chat works but autocomplete doesn’t? Your chat model doesn’t support FIM. Make sure tabAutocompleteModel is set to a Qwen 2.5 Coder variant, not a general chat model.
Models swapping constantly? If autocomplete and chat together exceed your VRAM, Ollama swaps between them. Set OLLAMA_KEEP_ALIVE=10m to keep the active model loaded longer, or use smaller models that both fit.
Qwen 3.5 won’t load? You need Ollama 0.17.4+. Run ollama --version to check. Update: curl -fsSL https://ollama.com/install.sh | sh. The Gated DeltaNet architecture in Qwen 3.5 requires the newer runtime.
Qwen 3.5 giving repetitive output? Fixed in Ollama 0.17.5. Update and redownload: ollama rm qwen3.5:9b && ollama run qwen3.5:9b.
The bottom line
Continue + Ollama is still the best free Copilot replacement as of March 2026. If Continue feels too heavy, Twinny does FIM + chat with less config. If you run llama.cpp directly, llama.vscode skips the Ollama layer. The recommended stack:
- Autocomplete: Qwen 2.5 Coder 7B (FIM-trained, fast)
- Chat: Qwen 3.5 9B (multimodal, 262K context, good reasoning)
- Agentic coding: Qwen3-Coder-Next on 24GB (70.6% SWE-Bench, multi-file edits)
You’ll trade some latency for complete privacy and zero recurring cost. For most developers — especially those working on proprietary code or tired of another subscription — that trade is easy to make.
# The 5-minute setup
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5-coder:7b-instruct-q4_K_M
ollama run qwen3.5:9b
# Install Continue in VS Code, paste the config from above, code.
Next steps:
- Best coding models in detail — deeper benchmarks and comparisons
- llama.cpp vs Ollama vs vLLM — which inference engine for what
- VRAM requirements by model — what fits on your GPU
- Local alternatives to Claude Code — terminal-based agentic coding
- Ollama troubleshooting — fix common setup issues
Get notified when we publish new guides.
Subscribe — free, no spam