Replace GitHub Copilot With Local LLMs in VS Code — Free, Private, No Subscription

More on this topic: Best Models for Coding Locally · llama.cpp vs Ollama vs vLLM · Local Alternatives to Claude Code · VRAM Requirements

GitHub Copilot costs $10/month for individuals and $19/month for business. Every keystroke, every prompt, every line of code goes to Microsoft’s servers. Hit rate limits during peak hours? That spinning cursor is Copilot throttling you.

Local LLMs flip all of that. Code stays on your machine. No subscription, no rate limits, no internet required. The quality gap has closed. Qwen 2.5 Coder 32B hits 92.9% on HumanEval, matching GPT-4o. The 7B variant scores 88.4% and runs on an 8GB GPU. And Qwen3-Coder-Next — released February 2026 — scores 70.6% on SWE-Bench Verified with only 3B active parameters, putting agentic coding within reach of a single consumer GPU.

Below: the VS Code extensions that work, a step-by-step Continue + Ollama setup, and which models to run at every VRAM tier.

Why replace Copilot

Your code stays on your machine. Copilot sends code context to GitHub’s servers with every request. Proprietary code, client projects, anything under NDA — that’s a liability. Local inference means nothing leaves your hardware.

No recurring cost. $10/month is $120/year. Business tier at $19/month is $228/year. Multiply across a team. Local models cost nothing to run once you own the GPU — and if you’re a developer with a gaming card, you already do.

No rate limits. Microsoft has tightened Copilot limits repeatedly since 2025. Business tier users report slower completions during peak hours. Your local model runs at the same speed at 3am or 3pm.

Works offline. Planes, air-gapped environments, coffee shops with dead WiFi. Copilot goes silent without internet. Local models don’t care.

The cost math

	Copilot Individual	Copilot Business	Local (8GB GPU)	Local (24GB GPU)
Monthly cost	$10	$19	$0	$0
Year 1 cost	$120	$228	$0 (own GPU) or ~$200 (used RTX 3060)	$0 (own GPU) or ~$350 (used RTX 3090)
Year 2 cost	$240	$456	$0	$0
Year 3 cost	$360	$684	$0	$0
Privacy	Code sent to Microsoft	Code sent to Microsoft	On your machine	On your machine
Rate limits	Yes	Yes	No	No
Works offline	No	No	Yes	Yes

A used RTX 3060 12GB runs $170-200. A used RTX 3090 runs $300-400. Either pays for itself in under two years versus Copilot, and you get a GPU that handles gaming, image generation, and LLM inference for everything else too.

The honest tradeoff: Copilot is faster. Cloud GPUs outperform consumer cards, and Copilot’s completions arrive in 200-400ms. Local models on a 3090 take 500ms-2s depending on size. For most developers, that’s fast enough. But if you need instant completions on every keystroke, cloud wins on raw latency.

Three tasks, three models

This is the part most guides get wrong. They recommend one model for everything. Autocomplete, chat, and agentic coding are different tasks with different requirements.

Task	What it does	Best model	Why
Autocomplete (FIM)	Ghost-text tab completions as you type	Qwen 2.5 Coder 1.5B-7B	FIM-trained, fast, made for fill-in-the-middle
Chat / reasoning	Explain code, write tests, refactor, answer questions	Qwen 3.5 9B or 27B	Better reasoning, 262K context, multimodal (reads screenshots)
Agentic coding	Autonomous file edits, terminal commands, multi-step tasks	Qwen3-Coder-Next (80B/3B active)	70.6% SWE-Bench, designed for agent workflows

Autocomplete needs a FIM-trained model. Fill-in-the-middle means the model sees code before and after your cursor, then predicts the gap. Regular chat models can’t do this. Qwen 2.5 Coder was trained specifically for it at every size from 0.5B to 32B.

Chat needs reasoning and instruction following. When you ask “refactor this function” or “explain this error,” you want Qwen 3.5’s reasoning ability and its 262K context window. Paste a screenshot of a stack trace and Qwen 3.5 reads it — Copilot can’t do that. As of March 2026, Qwen 3.5 9B is the community default for local coding chat — it tops every sub-10B benchmark and fits on 8GB VRAM. The 9B setup guide covers it in detail.

Agentic coding needs architecture-level understanding. Qwen3-Coder-Next was trained on 800,000 real GitHub PRs with executable test environments. It can plan multi-file changes, run tests, and iterate. The 3B active parameters keep it fast; the 80B total give it depth.

Speed matters more for autocomplete. You tolerate a 5-second wait for a chat response. You don’t tolerate it for every tab completion. Use a smaller, faster model for autocomplete (1.5B-7B) and a larger model for chat and agentic work.

The extensions, ranked

1. Continue — best overall

Detail	Info
GitHub Stars	~31,700
License	Apache 2.0
Ollama Support	Native, first-class
Autocomplete	Yes (FIM-based tab completion)
Chat	Yes (sidebar + inline)
Agent Mode	Yes (file edits, terminal, MCP)
Latest Version	v1.3.31 (Feb 2026)

Continue is the clear winner for local coding in VS Code. Native Ollama integration, separate models for autocomplete and chat, and full agent mode with MCP (Model Context Protocol) support.

Recent updates added MCP server loading from config files with environment variable templating, a PR inbox for reviewing pull requests inside VS Code, and “just-in-time” additional instructions for agent sessions. Continue also migrated from config.json to config.yaml — the new format uses a roles field to assign models to tasks (chat, autocomplete, edit, apply). The YAML configs throughout this guide already use the new format.

The multi-model setup is what makes it work. Run Qwen 2.5 Coder 1.5B for autocomplete (fast, small) and Qwen 3.5 9B for chat (smart, multimodal). Autocomplete stays snappy while complex questions get the big model’s full attention.

Context providers close much of the gap with Copilot’s cloud-powered context: @codebase searches your whole repo via local embeddings, @file references specific files, @terminal pulls recent terminal output.

Agent mode lets the LLM edit files and run terminal commands autonomously. Point it at Qwen3-Coder-Next and you get something closer to Claude Code than traditional autocomplete — running entirely on your hardware.

Who it’s for: Anyone replacing Copilot. It’s the one I’d install first.

2. Tabby — best for teams

Detail	Info
GitHub Stars	~33,000
License	Apache 2.0 (core) + proprietary EE
Ollama Support	Yes (HTTP connector)
Autocomplete	Yes (FIM, server-based)
Chat	Yes + Answer Engine
Self-Hosted	Yes (Rust server)

Tabby takes a different approach: you run a self-hosted server, and the VS Code extension connects to it. The server handles model inference, codebase indexing, and team management.

That architecture works well for teams. Set up one GPU server and every developer gets completions from it. The Answer Engine indexes your internal docs and codebase for RAG-powered answers. LDAP auth and usage analytics come built in.

Setup cost is higher than Continue. You’re running a Rust server, configuring backends, managing infrastructure. For a solo developer, that’s overkill. For a team of 5-10 developers sharing one beefy GPU server, it pays for itself on day one versus Copilot Business licenses.

Who it’s for: Teams who want shared infrastructure. Solo devs should use Continue.

3. CodeGPT — simplest setup

Detail	Info
VS Code Installs	~2.29 million
License	Freemium (local features free)
Ollama Support	Yes, native
Autocomplete	Yes
Chat	Yes (slash commands)

CodeGPT has the largest install base of any AI coding extension outside Copilot. Install the extension, select Ollama, pick a model, and it works.

Built-in slash commands (/Fix, /Document, /Refactor, /Unit Testing) are convenient. The downside: you can’t separately assign models for autocomplete versus chat, and the advanced context features aren’t as developed as Continue’s. Some features push toward their cloud platform.

Who it’s for: Developers who want the simplest possible setup and don’t need advanced configuration.

4. Avante.nvim — for Neovim users

Detail	Info
GitHub Stars	~17,500
License	Apache 2.0
Ollama Support	Yes (custom provider config)
Chat	Yes (sidebar panel)
Agent Mode	Yes (apply suggestions, edit in place)

Avante.nvim brings Cursor-style AI chat to Neovim. Select code, open the panel, describe what you want, get a diff you can accept or reject.

Avante doesn’t do tab autocomplete — pair it with cmp-ai pointed at Ollama for that. What it handles well is the “refactor this,” “explain this,” “write a test” workflow.

Who it’s for: Neovim users who want AI-assisted editing without leaving the terminal.

5. Twinny — lightweight and focused

Detail	Info
GitHub Stars	~5,000
License	MIT
Ollama Support	Native
Autocomplete	Yes (FIM)
Chat	Yes (sidebar)
Agent Mode	No

Twinny is the “just autocomplete and chat, nothing else” option. Install it, point it at Ollama, and it works. No YAML config migration, no MCP framework, no PR inbox. That’s the appeal — it does two things and does them without fuss.

It connects to Ollama, llama.cpp, LM Studio, or any OpenAI-compatible endpoint. There’s also a peer-to-peer inference network called Symmetry that lets you share GPU resources across machines, though most users stick with local Ollama.

Who it’s for: Developers who find Continue too heavy and just want FIM completions plus a chat sidebar.

6. llama.vscode — from the llama.cpp team

Detail	Info
License	MIT
Ollama Support	No (uses llama.cpp server directly)
Autocomplete	Yes (FIM)
Chat	Yes
Agent Mode	Yes
Latest Version	v0.0.42 (Jan 2026)

llama.vscode is built by the ggml/llama.cpp team. It skips Ollama entirely and connects straight to a llama.cpp server, which means one less layer and slightly lower overhead on resource-limited machines.

Designed for consumer hardware with no internet required. If you’re already running llama.cpp for other things and don’t want the Ollama wrapper, this is the native option.

Who it’s for: llama.cpp users who prefer to skip the Ollama abstraction layer.

7. Kilo Code — agentic alternative to Continue

Detail	Info
License	Apache 2.0
Ollama Support	Yes
Autocomplete	No
Chat	Yes
Agent Mode	Yes (multi-mode: Architect, Coder, Debugger)

Kilo Code is a newer agentic coding extension with multiple built-in modes — Architect for planning, Coder for implementation, Debugger for troubleshooting. It reads file structure, runs terminal commands, and can launch a browser for UI debugging.

It connects to Ollama and LM Studio for local models. Unlike Continue, it doesn’t do tab autocomplete — it’s purely an agent that reasons about your codebase and makes changes. Pair it with Twinny or llama.vscode for FIM completions.

Who it’s for: Developers who want agentic coding specifically and find Continue’s all-in-one approach too broad.

8. Cody — dead for individuals

Sourcegraph discontinued Cody’s Free and Pro plans in July 2025. New individual signups are blocked. Its Ollama support was experimental and limited — autocomplete context was restricted to the current file only. Skip this unless you’re on an enterprise Sourcegraph contract.

Quick comparison

Extension	Best For	Ollama	Tab Complete	Chat	Agent	Status
Continue	Solo devs, all-around	Native	Yes (FIM)	Yes	Yes	Active
Tabby	Teams, self-hosted	Yes	Yes (FIM)	Yes	No	Active
CodeGPT	Simplest setup	Native	Yes	Yes	No	Active
Twinny	Lightweight, minimal config	Native	Yes (FIM)	Yes	No	Active
llama.vscode	llama.cpp users, no Ollama	Direct llama.cpp	Yes (FIM)	Yes	Yes	Active
Kilo Code	Agentic coding	Yes	No	Yes	Yes	Active
Avante.nvim	Neovim users	Custom provider	No	Yes	Yes	Active
Cody	Enterprise only	Was experimental	—	—	—	Dead for individuals

Setup: Continue + Ollama (10 minutes)

This is the setup I recommend for most developers. Runs on 8GB VRAM, gives you tab completion and chat, and takes about 10 minutes.

Prerequisites

VS Code installed
A GPU with at least 6GB VRAM (RTX 3060, 4060, or equivalent)
Ollama 0.17.4+ (required for Qwen 3.5 — check with ollama --version)
~5-10GB free disk space for models

Step 1: Install Ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# macOS
brew install ollama

# Windows — download from https://ollama.com/download

Verify it’s running:

ollama --version

If you hit issues, check our Ollama troubleshooting guide.

Step 2: Pull your models

You need two models: a FIM model for tab autocomplete, and a chat model for conversations.

# Autocomplete model (FIM-trained, fast)
ollama pull qwen2.5-coder:7b-instruct-q4_K_M      # ~4.5GB, best quality/speed balance

# Chat model (pick one)
ollama run qwen3.5:9b                              # ~5GB, multimodal, 262K context

On 8GB VRAM, Ollama swaps between models — a few seconds of delay when switching from tab-complete to chat. On 12GB+, both fit simultaneously.

Step 3: Install Continue in VS Code

Open VS Code
Extensions (Ctrl+Shift+X)
Search “Continue”
Install “Continue - Codestral, Claude, and more” by Continue
The Continue sidebar panel appears on the left

Step 4: Configure Continue

Open ~/.continue/config.yaml (Continue sidebar → gear icon):

models:
  - name: Qwen 3.5 9B
    provider: ollama
    model: qwen3.5:9b
    roles:
      - chat
      - edit

tabAutocompleteModel:
  provider: ollama
  model: qwen2.5-coder:7b-instruct-q4_K_M

contextProviders:
  - name: codebase
  - name: file
  - name: terminal

This gives you Qwen 3.5 9B for chat (reasoning, multimodal, 262K context) and Qwen 2.5 Coder 7B for FIM autocomplete.

Step 5: Test it

Open any code file. Start typing a function and pause — ghost text suggestions appear. Press Tab to accept.

Open the Continue sidebar (Ctrl+L) and ask: “Explain this file” or “Write a test for this function.”

What to expect

Tab completions: 100-300ms on a modern GPU. Nearly instant for the 1.5B model.
Chat responses: 2-5 seconds for first tokens from the 9B, then streaming at 30-40 tok/s on a 3060 12GB.
Quality: Handles single-function completions, docstrings, simple refactors well. Struggles with complex multi-file reasoning — that’s what agentic mode is for.

Agentic coding: Qwen3-Coder-Next

Autocomplete and chat are solved problems at this point. The interesting part now is agentic coding — the model reads your codebase, plans changes across multiple files, runs tests, and iterates until the task passes.

What Qwen3-Coder-Next is

Released February 2026, Qwen3-Coder-Next is an 80B-parameter MoE model with only 3B active per token. That means it runs on 24GB VRAM while accessing far more knowledge than a traditional 3B model.

Spec	Value
Total parameters	80B
Active per token	3B
Context window	262K tokens
SWE-Bench Verified	70.6% (via SWE-Agent)
Aider benchmark	66.2%
VRAM (Q4)	~20GB
Training data	800,000 real GitHub PRs with executable tests

For comparison, GPT-4o scores ~64% on SWE-Bench Verified. Qwen3-Coder-Next beats it with 3B active parameters running on your desk.

Setup in Continue

# Pull the model (~20GB download)
ollama pull qwen3-coder-next

Add it to your Continue config as an agent model:

models:
  - name: Qwen3-Coder-Next
    provider: ollama
    model: qwen3-coder-next
    roles:
      - chat
      - edit

  - name: Qwen 3.5 9B
    provider: ollama
    model: qwen3.5:9b
    roles:
      - chat

tabAutocompleteModel:
  provider: ollama
  model: qwen2.5-coder:7b-instruct-q4_K_M

contextProviders:
  - name: codebase
  - name: file
  - name: terminal

Now you can switch between Qwen 3.5 9B for quick chat and Qwen3-Coder-Next for agentic tasks that need multi-file reasoning. Continue’s agent mode lets it edit files, run terminal commands, and iterate on test failures.

When to use agentic vs chat

Use chat (Qwen 3.5 9B) for: explaining code, writing a single function, generating docs, answering questions. Fast, fits on 8GB.

Use agentic (Qwen3-Coder-Next) for: implementing a feature across multiple files, refactoring an interface and updating all callers, debugging a test failure that spans several modules. Needs 24GB, slower, but handles the work that 9B models can’t.

OmniCoder-9B: agentic coding on 8GB

If 24GB for Qwen3-Coder-Next is out of reach, look at OmniCoder-9B. It’s a Qwen 3.5 9B fine-tune trained on 425,000+ agentic coding trajectories — the kind of multi-step scaffolding patterns used by Claude Code and similar tools. It scores 23.6% on Terminal-Bench 2.0, a 61% improvement over the base Qwen 3.5 9B. That’s not Qwen3-Coder-Next territory, but it fits on 8GB VRAM and actually tries to recover from errors, use edit diffs instead of full rewrites, and respond to LSP diagnostics. Available on Ollama (ollama pull carstenuhlig/omnicoder-9b) and as GGUF on Hugging Face. Apache 2.0 license.

A note on Qwen 3.5 35B-A3B

You might see people recommending the Qwen 3.5 35B-A3B (also 3B active) for coding. It’s fast — 60-100+ tok/s. But the 3B active parameters are spread across a general-purpose model, not a code-specialized one. For agentic coding tasks requiring architectural reasoning, the dense Qwen 3.5 27B or the code-specialized Qwen3-Coder-Next both perform better. The 35B-A3B is good for fast chat, not deep coding.

24GB setup: the full stack

If you have an RTX 3090 or 4090, here’s the three-model config that covers everything:

models:
  - name: Qwen3-Coder-Next (Agent)
    provider: ollama
    model: qwen3-coder-next
    roles:
      - chat
      - edit

  - name: Qwen 3.5 9B (Quick Chat)
    provider: ollama
    model: qwen3.5:9b
    roles:
      - chat

tabAutocompleteModel:
  provider: ollama
  model: qwen2.5-coder:7b-instruct-q4_K_M

contextProviders:
  - name: codebase
  - name: file
  - name: terminal
  - name: git

Ollama swaps models as you use them. Autocomplete stays on the 7B (fast, always responsive). Switch to Qwen 3.5 9B for quick questions, Qwen3-Coder-Next for heavy lifting. The swap takes a few seconds but you only pay it when switching tasks.

What works well locally

Tab completion is where local shines brightest. FIM-trained models like Qwen 2.5 Coder produce fast, accurate completions for single-line and short multi-line suggestions. Closest experience to Copilot.

Docstrings, comments, and simple refactors work well too. The code is right there in context. Rename variables, extract functions, convert a loop to a comprehension, add type hints. Single-file, well-scoped changes are reliable with 7B+ models.

Explaining unfamiliar code is solid. Paste it in, ask what it does. The full context is in the prompt, so the model doesn’t need to reason across your whole project.

Screenshot reading is where local actually beats Copilot. Qwen 3.5 is natively multimodal. Paste a screenshot of a stack trace or error dialog and the model reads it. Copilot can’t do this at all.

What still struggles locally

Large codebase context is the biggest gap. Copilot’s cloud infrastructure processes context across your workspace. Local models are limited by their context window and VRAM. Continue’s @codebase provider does RAG over your repo, but it’s not the same as having the full project in context.

Multi-file edits with small models are hit or miss. Ask a 7B to refactor an interface and update all implementations, and you’ll get inconsistent results. Qwen3-Coder-Next handles this better, but it needs 24GB.

Speed on complex prompts goes to cloud. A 32B model takes 10-15 seconds for a full response. Copilot returns similar quality in 2-3 seconds. For rapid-fire Q&A, the latency adds up.

Uncommon languages are thin. Qwen 2.5 Coder is strong in Python, TypeScript, Java, C++, Go, and Rust. For Elixir, Haskell, or very new frameworks, quality drops noticeably.

Model recommendations by VRAM

VRAM	Autocomplete	Chat	Agentic	What you get
6-8GB	Qwen 2.5 Coder 1.5B (Q4)	Qwen 3.5 9B (Q4, ~5GB)	OmniCoder-9B (Q4, ~5GB)	FIM autocomplete + multimodal chat + basic agentic. Models swap.
12GB	Qwen 2.5 Coder 7B (Q4)	Qwen 3.5 9B (Q4)	—	Both fit in VRAM simultaneously. Strong combo.
16GB	Qwen 2.5 Coder 7B (Q4)	Qwen 3.5 9B (Q6-Q8)	—	Higher quality chat. Room for long context.
24GB	Qwen 2.5 Coder 7B (Q4)	Qwen 3.5 9B (Q4)	Qwen3-Coder-Next (Q4, ~20GB)	Full stack. Swap between chat and agent.
24GB (alt)	Qwen 2.5 Coder 7B (Q4)	Qwen 2.5 Coder 32B (Q4)	—	92.9% HumanEval. Better for pure code, no vision.

Pull commands by tier

8GB VRAM:

ollama pull qwen2.5-coder:1.5b-instruct-q4_K_M   # autocomplete
ollama run qwen3.5:9b                              # chat (swaps with autocomplete)

12-16GB VRAM:

ollama pull qwen2.5-coder:7b-instruct-q4_K_M      # autocomplete
ollama run qwen3.5:9b                              # chat (both fit simultaneously on 12GB+)

24GB VRAM:

ollama pull qwen2.5-coder:7b-instruct-q4_K_M      # autocomplete
ollama run qwen3.5:9b                              # quick chat
ollama pull qwen3-coder-next                       # agentic coding (swap with chat)

Performance: local vs Copilot

Metric	Copilot (Cloud)	Local 8GB	Local 24GB
Tab completion latency	200-400ms	300-600ms (Coder 1.5B)	300-500ms (Coder 7B)
Chat first-token	1-2s	2-4s (9B)	2-4s (9B)
Chat throughput	50-80 tok/s	30-40 tok/s	30-40 tok/s
Agentic (SWE-Bench)	~64% (GPT-4o)	—	70.6% (Qwen3-Coder-Next)
Multi-file context	Full workspace	RAG-based	RAG-based
Multimodal input	No	Yes (Qwen 3.5)	Yes (Qwen 3.5)
Monthly cost	$10-19	$0	$0
Code privacy	Sent to Microsoft	On your machine	On your machine
Works offline	No	Yes	Yes
Rate limits	Yes (tightening)	No	No

Local wins on privacy, cost, multimodal, and agentic benchmarks. Copilot wins on latency and multi-file context. Most developers who switch keep both for a week, then drop Copilot once they realize they’re not missing much.

Troubleshooting

Autocomplete not showing up? Make sure Ollama is running (ollama serve or check the system tray). Verify the model name in config.yaml matches what you pulled. Open the Continue output panel (View → Output → Continue) for error messages.

Completions are slow? Run ollama ps to check memory usage. If the model doesn’t fully fit in VRAM, inference drops to CPU speed. Use a smaller model or more aggressive quantization (Q3_K_M).

Chat works but autocomplete doesn’t? Your chat model doesn’t support FIM. Make sure tabAutocompleteModel is set to a Qwen 2.5 Coder variant, not a general chat model.

Models swapping constantly? If autocomplete and chat together exceed your VRAM, Ollama swaps between them. Set OLLAMA_KEEP_ALIVE=10m to keep the active model loaded longer, or use smaller models that both fit.

Qwen 3.5 won’t load? You need Ollama 0.17.4+. Run ollama --version to check. Update: curl -fsSL https://ollama.com/install.sh | sh. The Gated DeltaNet architecture in Qwen 3.5 requires the newer runtime.

Qwen 3.5 giving repetitive output? Fixed in Ollama 0.17.5. Update and redownload: ollama rm qwen3.5:9b && ollama run qwen3.5:9b.

The bottom line

Continue + Ollama is still the best free Copilot replacement as of March 2026. If Continue feels too heavy, Twinny does FIM + chat with less config. If you run llama.cpp directly, llama.vscode skips the Ollama layer. The recommended stack:

Autocomplete: Qwen 2.5 Coder 7B (FIM-trained, fast)
Chat: Qwen 3.5 9B (multimodal, 262K context, good reasoning)
Agentic coding: Qwen3-Coder-Next on 24GB (70.6% SWE-Bench, multi-file edits)

You’ll trade some latency for complete privacy and zero recurring cost. For most developers — especially those working on proprietary code or tired of another subscription — that trade is easy to make.

# The 5-minute setup
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5-coder:7b-instruct-q4_K_M
ollama run qwen3.5:9b
# Install Continue in VS Code, paste the config from above, code.

Next steps:

Best coding models in detail — deeper benchmarks and comparisons
llama.cpp vs Ollama vs vLLM — which inference engine for what
VRAM requirements by model — what fits on your GPU
Local alternatives to Claude Code — terminal-based agentic coding
Ollama troubleshooting — fix common setup issues