πŸ“š More on this topic: VRAM Requirements Β· GPU Buying Guide Β· Ollama vs LM Studio Β· Qwen 3.5 Small Models: 9B Beats 30B

GitHub Copilot costs $10-19/month. ChatGPT Plus is $20. Claude Pro is $20. That’s $120-240 per year for coding assistance that sends every line of your code to someone else’s servers.

Local coding models flip that equation. You run them on your own hardware, your code never leaves your machine, and the total cost after setup is zero. The tradeoff used to be quality β€” local models couldn’t compete with cloud. That changed in 2025. Open-source coding models now match GPT-4o on standard benchmarks, and they run on hardware most developers already own.

This guide covers which models to use at every VRAM tier, how they compare on real benchmarks, and exactly how to set them up in your editor.


Why Code Locally?

Four reasons developers are switching:

Your code stays private. Every prompt you send to Copilot or ChatGPT passes through corporate servers. If you’re working on proprietary code, client projects, or anything sensitive, that’s a risk. Local models process everything on your machine. Nothing leaves.

No recurring costs. $19/month for Copilot Business doesn’t sound like much until you multiply it across a team or count years. Local models are free after the hardware investment β€” which you’ve probably already made if you have a GPU.

Works offline. Planes, coffee shops with bad WiFi, air-gapped environments, or just your ISP having a bad day. Local models don’t care. No internet required.

No rate limits or surprise changes. No throttling during peak hours, no model swap without notice, no features removed from your tier. You control the model, the version, and the configuration.


What Makes a Good Coding Model

Not all LLMs are equal at code. The best coding models share four traits:

Code completion (FIM). Fill-in-the-middle support means the model can complete code given both the text before and after the cursor. This is what powers inline autocomplete in your editor. Not all models support FIM β€” coding-specific ones do.

Instruction following. “Refactor this function,” “explain this regex,” “write tests for this module.” The model needs to follow natural-language instructions about code precisely, not just generate code from scratch.

Multi-language support. Most developers work across at least 2-3 languages. The model should handle Python, JavaScript/TypeScript, and your stack’s other languages without falling apart.

Sufficient context window. Code context matters. A model that can only see 2K tokens is useless when your function depends on types defined 500 lines up. You want at least 8K, ideally 32K+.

The Benchmarks That Matter

BenchmarkWhat It TestsWhy It Matters
HumanEvalGenerate correct Python functions from docstrings (164 tasks)The standard for code generation quality
HumanEval+Same tasks, 80x more test casesCatches models that pass easy tests but fail edge cases
MBPP974 programming problemsBroader than HumanEval, tests practical coding
MultiPL-EHumanEval translated to 18+ languagesShows if the model only knows Python or actually handles JS, Rust, etc.
LiveCodeBench600+ real coding contest problemsTests harder, more realistic tasks

HumanEval pass@1 is the most commonly reported number. Higher is better. For reference: GPT-4o scores ~90%, GPT-3.5 scored ~48% when it launched.


Best Models by VRAM Tier

8GB VRAM (RTX 4060, 3070, 3060 Ti)

This is the most common GPU tier for developers. The good news: the best 7B coding model now outperforms much larger models from a year ago.

ModelHumanEvalFIMContextVRAM (Q4)Best For
Qwen 2.5 Coder 7B88.4%Yes128K~5 GBBest for autocomplete/FIM
Qwen 3.5 9B ⭐—†No262K6.6 GBBest for chat coding, multimodal
DeepSeek Coder V2 Lite81.1%Yes128K~5 GBReasoning-heavy tasks
DeepSeek Coder 6.7B~65%Yes16K~4.5 GBLightweight, fast
CodeLlama 7B~30%Yes16K~4.5 GBLegacy. Skip for new setups.

†Qwen 3.5 9B isn’t a coding-specific model so HumanEval isn’t reported. Its LiveCodeBench v6 score is 65.6, and it beats Qwen3-30B on GPQA Diamond (81.7 vs 77.2) and IFEval (91.5 vs 88.9).

For autocomplete: Qwen 2.5 Coder 7B. Still the FIM king. 88.4% HumanEval at 7B parameters, beats CodeStral-22B and DeepSeek Coder 33B V1. FIM support, 128K context, 92+ languages. This is what powers your tab-complete.

ollama pull qwen2.5-coder:7b

For chat-based coding: Qwen 3.5 9B. This dropped March 2, 2026 and it’s a different animal. Not a coding-specific model, but a general-purpose model with native multimodal and 262K context that happens to be excellent at code. What makes it worth running alongside your autocomplete model:

  • Reads images natively. Paste a code screenshot, whiteboard photo, or error dialog and it understands them. No separate vision model. Same weights handle text and images.
  • 262K context. Enough to fit large files or multiple files in one prompt. The Qwen 2.5 Coder 7B tops out at 128K and eats more VRAM doing it.
  • Thinking mode. Enable it for harder problems: ollama run qwen3.5:9b with /set parameter num_predict 81920 and the model generates <think> blocks before answering. This substantially improves results on complex debugging and refactoring tasks.
  • 6.6GB on Ollama at Q4_K_M. Fits 8GB VRAM with room to spare.
ollama run qwen3.5:9b

The play at 8GB: run Qwen 2.5 Coder 7B for autocomplete, swap to Qwen 3.5 9B for chat, debugging, and code review. They don’t need to run simultaneously.

DeepSeek Coder V2 Lite is still solid for reasoning-heavy tasks. 16B MoE with only 2.4B active parameters, so it’s fast and memory-efficient.

CodeLlama 7B is showing its age. At ~30% HumanEval, it’s been lapped by models half its size from newer families. Only use it if you have a specific reason (e.g., Meta’s Llama license requirements).

Laptops and Integrated Graphics: Qwen 3.5 4B

If you’re on a laptop with no discrete GPU, or you need a coding assistant on integrated graphics, the Qwen 3.5 4B is the new minimum viable coding model. It dropped alongside the 9B on March 2, 2026.

At 3.4GB on Ollama, it runs on basically anything with 4GB of RAM to spare. LiveCodeBench v6 score of 55.8, GPQA Diamond of 76.2, and it’s natively multimodal with the same 262K context as its bigger sibling. It won’t replace a proper coding model for complex tasks, but for quick code explanations, simple refactoring, and reading error screenshots on a thin laptop, nothing else at this size comes close.

ollama run qwen3.5:4b

This replaces anything sub-4B you might have been running for lightweight coding. The full breakdown is here.

12-16GB VRAM (RTX 3060 12GB, 5060 Ti 16GB, 4060 Ti 16GB)

The mid-range tier opens up significantly better models. If you’ve got 16GB of VRAM, you can run 14B models at high quality or squeeze in a quantized 33B.

ModelHumanEvalFIMContextVRAM (Q4)Best For
Qwen 2.5 Coder 14B~89%Yes128K~9 GBBest at this tier
DeepSeek Coder 33B (Q3)70%Yes16K~16 GB24GB model squeezed down
CodeLlama 13B~36%Yes16K~8.5 GBLegacy. Outclassed.

The winner: Qwen 2.5 Coder 14B. It surpasses CodeStral-22B and DeepSeek Coder 33B on benchmarks despite being smaller. At Q4 quantization it needs only ~9GB, leaving room for long context on a 16GB card. State-of-the-art on over 10 code evaluation benchmarks.

ollama pull qwen2.5-coder:14b

DeepSeek Coder 33B at Q3 quantization technically fits in 16GB, but it’s a tight squeeze with degraded quality. If you have exactly 16GB, the Qwen 14B at higher quantization (Q5 or Q6) will outperform it in practice.

24GB VRAM (RTX 3090, 4090)

This is where local coding gets seriously competitive with cloud models. A used RTX 3090 at $700-850 gives you access to models that match GPT-4o.

ModelHumanEvalFIMContextVRAM (Q4)Best For
Qwen 2.5 Coder 32B92.7%Yes128K~20 GBBest FIM/autocomplete at 24GB
Qwen 3.5 27Bβ€”No262K~16 GBDense, ties GPT-5 mini on SWE-bench
Qwen 3.5 35B-A3Bβ€”No262K~20 GBMoE, fast agentic coding
DeepSeek Coder 33B70% (78%*)Yes16K~20 GBOlder but solid
CodeLlama 34B53.7%Yes16K~20 GBLegacy. Outclassed.

*78% with CodeFuse fine-tuning

For autocomplete: Qwen 2.5 Coder 32B. Still the FIM king at 24GB. 92.7% HumanEval, 73.7 on Aider (code repair), 128K context. At Q4_K_M (~20GB), it fits on a single 24GB card.

ollama pull qwen2.5-coder:32b

For chat coding and reasoning: Qwen 3.5 27B. A dense 27B model that ties GPT-5 mini on SWE-bench Verified (72.4%) and fits at ~16GB Q4, leaving 8GB for context on a 24GB card. No FIM support, but the 262K context window and native multimodal make it a strong pair with the Coder 32B. Use Coder for tab-complete, 27B for “refactor this module” conversations.

The March benchmarks filled in the picture: LiveCodeBench v6 of 80.7 and Terminal-Bench 2 of 41.6 (GPT-5 mini scores 31.9 on the same test). It also has native tool calling via the qwen3_coder parser β€” VS Code extensions like Continue can now dispatch tool calls directly to the 27B for file edits, shell commands, and web lookups without external wrappers.

ollama run qwen3.5:27b

For fast agentic coding: Qwen 3.5 35B-A3B. This MoE model holds 35B total parameters but only fires 3B per token. On an RTX 3090, one user measured 112 tok/s, and another scaffolded 10 files and 3,483 lines from a single spec. It beats the previous-gen Qwen3-235B-A22B on GPQA Diamond (84.2 vs 81.1) with 1/7th the parameters. The catch: it needs ~20GB to load despite only 3B being active, so you won’t fit it alongside another model on 24GB.

ollama run qwen3.5:35b-a3b

The 24GB tier now has three distinct coding strategies: Coder 32B for FIM, 27B dense for chat reasoning, 35B-A3B MoE for fast agentic work. If you have a 3090 or 4090, pick two and swap between them.

48GB+ / Mac: Qwen3-Coder-Next (Agentic Coding)

Released February 3, 2026, Qwen3-Coder-Next is a different beast. It’s an 80B MoE that activates only 3B parameters per token, built specifically for agentic coding workflows β€” the kind where the model reads your repo, plans changes across multiple files, and executes them. It’s instruct-mode, not thinking β€” no <think> blocks, just direct tool-use and code generation, which keeps latency low for agentic loops.

MetricScore
SWE-bench Verified70.6%
SWE-bench Pro44.3%
SWE-rebench Pass@564.6% (#1 overall)
Context256K (extendable to 1M with YaRN)
VRAM (Q4)~35-40 GB
LicenseApache 2.0

That SWE-rebench Pass@5 of 64.6% is the headline number: it’s #1 on the entire SWE-rebench leaderboard, beating every closed model including Claude Opus 4.6 (58.3%), GPT-5.2-medium (60.4%), and Gemini 3 Pro (58.3%). SWE-bench Verified at 70.6% puts it ahead of DeepSeek V3.2 (70.2%) on real-world GitHub issue resolution. The 3B active parameter count means throughput is high for its capability level β€” 20-40 tok/s on consumer hardware, 10-15 tok/s on a Mac Mini M4 Pro 64GB.

ollama pull qwen3-coder-next

The catch: VRAM. At Q4, it needs ~35-40GB. A single RTX 3090/4090 can’t hold it without CPU offload (which drops you to ~12 tok/s). This model shines on:

  • Mac with 48GB+ unified memory β€” runs natively, no offloading
  • Dual 24GB GPUs β€” if you’ve got two 3090s
  • CPU offload on 24GB GPU + 32GB RAM β€” usable but slower

Unsloth requantized GGUFs (March update): Unsloth’s initial Qwen3-Coder-Next GGUFs had a bug that applied MXFP4 quantization to attention tensors instead of just routed expert weights, causing measurable quality degradation. This was fixed on February 27 β€” all quants were regenerated. If you downloaded before that date, re-download. Use Q4_K_M or MXFP4_MOE variants for the best quality-to-size ratio.

This isn’t a replacement for Qwen 2.5 Coder 32B on a single 24GB card. It’s what you graduate to when you want agentic workflows β€” tools like Aider or Claude Code pointed at a local model that can reason across an entire codebase. If you have the memory, it’s the strongest open-source coding agent model available right now.


The Master Comparison

Every model side by side:

ModelParamsHumanEvalSWE-benchVRAM (Q4)FIMContextLicense
Qwen3-Coder-Next ⭐80B (3B active)β€”70.6% (rebench P@5: 64.6%)~38 GBNo256KApache 2.0
Qwen 3.5 35B-A3B35B (3B active)β€”69.2%~20 GBNo262KApache 2.0
Qwen 3.5 27B27Bβ€”72.4%~16 GBNo262KApache 2.0
Qwen 2.5 Coder 32B32B92.7%β€”~20 GBYes128KApache 2.0
Qwen 2.5 Coder 14B14B~89%β€”~9 GBYes128KApache 2.0
Qwen 2.5 Coder 7B7B88.4%β€”~5 GBYes128KApache 2.0
Qwen 3.5 9B9B—†—6.6 GBNo262KApache 2.0
Qwen 3.5 4B4B—†—3.4 GBNo262KApache 2.0
DS Coder V2 Lite16B (2.4B active)81.1%β€”~5 GBYes128KMIT
DS Coder 33B (V1)33B70%β€”~20 GBYes16KPermissive
CodeLlama 34B34B53.7%β€”~20 GBYes16KLlama
CodeLlama 7B7B~30%β€”~4.5 GBYes16KLlama

†Qwen 3.5 models are general-purpose multimodal, not coding-specific. LiveCodeBench v6: 65.6 (9B) / 55.8 (4B). They excel at reasoning, instruction-following, and vision tasks β€” see full benchmarks.

Four tiers of coding model now, each best at different things. Qwen 2.5 Coder owns FIM autocomplete β€” nothing touches it for tab-complete at any size. Qwen 3.5 9B/4B handle chat-based coding with native vision and 262K context. At 24GB, the Qwen 3.5 27B (72.4% SWE-bench, dense) and 35B-A3B (112 tok/s on a 3090, MoE) give you two strong options for chat coding and agentic workflows. And Qwen3-Coder-Next is the 48GB+ agentic option β€” it resolves real GitHub issues at 70.6% on SWE-bench, frontier-model territory for hardware you own.

β†’ Check what fits your hardware with our Planning Tool.


Best Model by Language

All models above are multi-language, but some have particular strengths.

LanguageBest Local ModelNotes
PythonQwen 2.5 Coder (any size)Best benchmarked language across all models
JavaScript/TypeScriptQwen 2.5 Coder 14B+Strong JS/TS support; 7B handles it well too
RustQwen 2.5 Coder 32BSmaller models struggle with Rust’s borrow checker; 32B handles it
GoQwen 2.5 Coder 14B+Clean Go output from 14B up
C/C++DeepSeek Coder 33BSlightly better at low-level memory management patterns
JavaQwen 2.5 Coder 14B+Good boilerplate generation, understands frameworks
SQLQwen 2.5 Coder (any size)82% on Spider benchmark β€” well ahead of competitors

The honest caveat: For Python and JavaScript, the 7B Qwen Coder is genuinely excellent. For Rust, C++, and other complex compiled languages, bigger models produce noticeably better results. If Rust is your primary language and you only have 8GB, expect some friction β€” the model will get syntax right but occasionally misunderstand lifetime annotations or trait bounds.


How to Set Up Local Coding in Your Editor

This is the free, open-source Copilot replacement. Continue is a VS Code extension that connects to Ollama for both chat and autocomplete.

Step 1: Install Ollama

If you haven’t already, follow our Ollama setup guide. One command on any OS.

Step 2: Pull your coding model

# Pick your tier:
ollama pull qwen2.5-coder:7b     # 8GB VRAM
ollama pull qwen2.5-coder:14b    # 16GB VRAM
ollama pull qwen2.5-coder:32b    # 24GB VRAM

Step 3: Install Continue extension

Open VS Code β†’ Extensions (Ctrl+Shift+X) β†’ Search “Continue” β†’ Install.

Step 4: Configure Continue

Create or edit ~/.continue/config.yaml:

name: Local Coding
version: 0.0.1
schema: v1
models:
  - uses: ollama/qwen2.5-coder-7b

For autocomplete (tab completion), Continue uses a separate, smaller model by default. You can also point it at your main coding model.

Step 5: Start coding

  • Chat: Click the Continue icon in the sidebar, ask questions about your code
  • Autocomplete: Start typing and suggestions appear inline
  • Edit: Select code, press Ctrl+I, describe the change you want

Everything runs locally. No API keys, no accounts, no internet.

Option 2: LM Studio as Backend

If you prefer LM Studio’s visual interface, you can use it as a backend for Continue too. Start LM Studio’s local server, then point Continue at http://localhost:1234/v1.

This is useful if you like browsing and switching between models visually.

Option 3: Tabby (Self-Hosted Copilot)

Tabby is a self-hosted AI coding assistant with its own VS Code extension, JetBrains plugin, and Vim support. It’s more opinionated than Continue β€” closer to a full Copilot replacement with built-in code indexing.

docker run -d --gpus all -p 8080:8080 \
  -v $HOME/.tabby:/data \
  registry.tabbyml.com/tabbyml/tabby serve \
  --model Qwen2.5-Coder-7B \
  --device cuda

Tabby works well for teams who want a shared local coding server. For solo developers, Continue + Ollama is simpler.

Option 4: Aider (Terminal)

Aider is a terminal-based coding assistant that edits files directly. Point it at your local Ollama instance:

pip install aider-chat
aider --model ollama/qwen2.5-coder:7b

It understands your git repo, makes edits across files, and creates commits. Best for developers who live in the terminal.


Practical Tips

FIM vs. Chat: Know the Difference

FIM (Fill-in-the-Middle) powers inline autocomplete β€” the cursor is in the middle of your code and the model predicts what goes there. This is what makes “tab complete” work. Qwen 2.5 Coder and DeepSeek Coder both support FIM.

Chat mode is for conversations: “explain this function,” “refactor this class,” “write tests.” It’s a different inference mode. Most editors let you use both simultaneously β€” FIM for autocomplete, chat for dialogue.

Keep a Small Model for Autocomplete

Autocomplete needs to be fast β€” under 200ms ideally. On 8GB, your main 7B coding model handles both chat and FIM fine. On 16-24GB, consider running a smaller model (Qwen 2.5 Coder 1.5B or 3B) for autocomplete and your bigger model for chat. This keeps tab-complete snappy while giving you full power for longer tasks.

Context Window vs. VRAM

Bigger context windows eat more VRAM. If you’re working on a large codebase and want the model to “see” more files, you’ll burn through your VRAM headroom fast. On 8GB, stick to 4-8K context for coding. On 16GB, you can push to 16K. On 24GB, 32K is comfortable.

For navigating large codebases, quantization at Q4_K_S instead of Q4_K_M saves a few hundred MB that can go toward context.

Tool Calling Fixes (March 2026)

If you’re using Qwen models for agentic coding (tool calls, function calling), two recent updates matter:

llama.cpp merged fixes for Qwen’s XML-tagged tool call format. Earlier builds would choke on the non-standard XML tags Qwen3-Coder uses for function names and parameters, or hit a key_gdiff calculation bug that caused looping output. The Qwen 3.5 chat template also had a tool-calling bug that’s now patched. If you built llama.cpp before early March 2026, rebuild from source.

exllamav3 added full Qwen 3.5 support β€” both the dense models (27B) and MoE variants (35B-A3B), including multimodal. If you’re running ExLlama for faster inference, Qwen 3.5 models now work out of the box.

Close Your Browser

Same advice as for any 8GB VRAM workload: Chrome’s hardware acceleration eats GPU memory. Close it or disable GPU acceleration when running local models. On 24GB this matters less, but on 8-16GB it can be the difference between smooth inference and OOM errors.


The Bottom Line

Local coding models now cover three distinct jobs:

  1. Autocomplete/FIM: Qwen 2.5 Coder (7B/14B/32B) β€” still the best at every tier, nothing has displaced it.
  2. Chat coding: Qwen 3.5 9B β€” native vision, 262K context, thinking mode. The full family runs from 0.8B to 397B, all Apache 2.0.
  3. Agentic coding: Qwen3-Coder-Next β€” #1 on SWE-rebench Pass@5 (64.6%), 70.6% SWE-bench Verified, needs 48GB+ but only activates 3B params so it’s fast. Instruct-mode, not thinking.

The 2-model setup for 8GB VRAM:

# 1. Install Ollama (if you haven't)
curl -fsSL https://ollama.com/install.sh | sh

# 2. Autocomplete model (FIM, tab-complete)
ollama pull qwen2.5-coder:7b

# 3. Chat/reasoning model (multimodal, 262K context)
ollama pull qwen3.5:9b

# 4. Install Continue extension in VS Code, configure, and code

No subscriptions. No data leaving your machine. No rate limits. Just you, your code, and two models that cover different parts of the workflow.



Sources: Qwen2.5-Coder Technical Report, Qwen3-Coder-Next Model Card, Qwen3.5-27B Model Card, Qwen3.5-9B Model Card, Qwen 3.5 on Ollama, SWE-rebench Leaderboard, Unsloth Qwen3-Coder-Next GGUF, ExLlamaV3 GitHub, DeepSeek Coder GitHub, Continue.dev Ollama Guide, EvalPlus Leaderboard, Tabby