Best Local Coding Models Ranked: Every VRAM Tier, Every Benchmark (2026)
๐ More on this topic: Qwen 3.6 Complete Guide ยท DeepSeek V4 Flash vs Pro ยท Local Claude Code Alternatives ยท llama.cpp vs Ollama vs vLLM ยท VRAM Requirements
GitHub Copilot costs $10-19/month. ChatGPT Plus is $20. Claude Pro is $20, and as of April 4, 2026, it no longer covers third-party agent harnesses like OpenClaw โ Anthropic pushed those users to per-token API billing (context here). That change pushed a wave of “switch to local” posts on r/LocalLLaMA, right as the local options got a lot better.
Two releases inside 48 hours reset this list. Qwen3.6-27B dense landed April 22 with Qwen claiming “flagship-level agentic coding” in a 27B model. DeepSeek V4-Flash landed the evening of April 23 โ 284B total / 13B active, MIT-licensed, priced at Haiku-tier rates through the DeepSeek API. Add the mid-April Qwen3.6-35B-A3B MoE that runs on a 16GB card with RAM offload and the picture looks different than it did three weeks ago.
This guide covers which model to run at every VRAM tier, what the benchmarks say vs what community users report, and how to wire a local coding stack up in your editor.
What’s New (May 2026)
The April 24 update brought Qwen 3.6 and DeepSeek V4 into this list. Two more weeks of release cadence have not slowed down. Here’s what changed since.
Gemma 4 26B-A4B MoE. Google’s first MoE that actually competes for the local-coding seat. 26B total, 4B active, 1441 LMArena. An r/LocalLLaMA thread today clocked a 5090 user at 600 tok/s on this model with vLLM and aggressive batching. The hardware fit is what makes it interesting: 4B active means the inference math is cheap, and decoupled-attention runtimes like LARQL run it on hardware that should not handle 26B weights. The Gemma 4 local guide covers VRAM and quants.
DeepSeek V4 cost economics got real. V4 Pro hit FoodTruck Bench on May 5 at 17x cheaper than GPT-5.2 for the same evaluation pass. V4 Flash remains the low-latency pick for tool-calling agent loops. The V4 Flash vs Pro guide has the breakdown.
Speculative decoding moved from research to production. I built and benched DFlash on my own RTX 3090 at 2.56x mean speedup on Qwen 3.6-27B Q4_K_M. Mainline llama.cpp added MTP for Qwen 3.6 via PR #22673 on May 4. The DFlash vs MTP head-to-head covers which fits which workload.
The community verdict. A May 8 r/ollama thread directly compared Qwen 3.6, Qwen3-Coder, and DeepSeek-Coder. The Qwen 3.6 family won across the board on local hardware. The 27B dense for tool-using coding agents, the 35B-A3B MoE for general-purpose work on tighter VRAM. If you’re not running Qwen, DeepSeek, or Gemma 4 in 2026, you’re behind the current generation.
The body of this article still maps out the per-tier picks. The Qwen 3.6 + DeepSeek V4 + Gemma 4 trio is the headline answer.
The Qwen 3.6-27B Moment
Qwen dropped Qwen3.6-27B on April 22, 2026, and labeled it, in their own words, “flagship-level agentic coding performance, surpassing the previous-generation open-source flagship.” The previous flagship was the 397B-A17B MoE. A 27B dense model claiming to beat a 397B MoE on coding is either marketing or a real step change. The early returns suggest a real step change, with caveats.
Qwen’s own benchmarks (per the Qwen3.6-27B model card):
| Benchmark | Qwen3.6-27B |
|---|---|
| SWE-bench Verified | 77.2 |
| SWE-bench Pro | 53.5 |
| SWE-bench Multilingual | 71.3 |
| Terminal-Bench 2.0 | 59.3 |
| LiveCodeBench v6 | 83.9 |
| MMLU-Pro | 86.2 |
SWE-bench Verified at 77.2 is the headline. That puts the 27B in the same range as Claude Sonnet 4.6 on real-world GitHub-issue resolution, per Qwen’s own comparison. Independent AA Agentic Index placement is still settling โ the Artificial Analysis Intelligence Index currently shows Claude Opus 4.7 at 57 and DeepSeek V4 Pro at 52 in the top 10, with Qwen 3.6-27B not yet ranked there. Treat Qwen’s own numbers as the model card, not as independent verification.
What the community actually saw. Simon Willison’s Qwen 3.6-27B post benchmarked the Q4_K_M GGUF (16.8GB) on llama-server at 65K context: 25.57 tok/s generating a pelican-on-a-bicycle SVG, which he called “an outstanding result for a 16.8GB local model.” r/LocalLLaMA posts from the week of April 22 report about 50 tok/s on an RTX 5090 at Q6_K with 200K context loaded. An RTX 5080 at 16GB with aggressive quantization comes in closer to 6 tok/s โ usable, not fast.
Claude Code / OpenCode compatibility. Multiple r/LocalLLaMA threads in the week after release report pointing Claude Code and OpenCode at a local Qwen3.6-27B endpoint and getting genuinely useful agentic runs โ “vibe codes perfectly fine” in the phrasing one commenter used. That’s week-one harness feedback, not a verdict, but the pattern is consistent across posts.
Honest caveats. The r/LocalLLaMA reaction split. Skeptics pointed at the pelican benchmark and asked whether Qwen trained on the test. Hacker News commenters raised Goodhart’s Law concerns and flagged tool-use failures where the model repeats failed actions without reading back context. Dense architecture means higher VRAM cost per unit capability than an MoE at the same capability class. And the sampling parameters matter โ a lot. Qwen’s card recommends temperature 0.6, top_p 0.95, top_k 20, presence_penalty 0.0 for “precise coding” in thinking mode. Ignore those and you get a different model.
Bottom line on 3.6-27B. Strong at what it was trained for. That covers most of what people actually use a coding model for. Not a drop-in frontier-cloud replacement for the hardest cases. For a 24GB GPU, it’s the new default.
ollama pull qwen3.6:27b
Qwen 3.6-35B-A3B: Runs on 16GB
The MoE sibling shipped mid-April. 35B total parameters, 3B active per token, 256 experts with 8 routed + 1 shared active per token. Apache 2.0, 262K context, same hybrid DeltaNet + Attention stack as the 27B. The story here is the VRAM math: the model loads 35B of weights into memory but only fires 3B on each token, so throughput is fast and the memory footprint is manageable.
Qwen’s own benchmarks (per the Qwen3.6-35B-A3B model card):
| Benchmark | Qwen3.6-35B-A3B |
|---|---|
| SWE-bench Verified | 73.4 |
| SWE-bench Pro | 49.5 |
| Terminal-Bench 2.0 | 51.5 |
| LiveCodeBench v6 | 80.4 |
Same caveat as above: Qwen’s numbers. Directionally matches the early community reports.
What it runs on. Per the Unsloth Qwen3.6-35B-A3B GGUF card:
| Quant | File size |
|---|---|
| UD-IQ1_M | 10 GB |
| UD-Q3_K_M | 16.6 GB |
| UD-Q4_K_M | 22.1 GB |
| Q8_0 | 36.9 GB |
A 24GB card runs UD-Q4_K_M clean. A 16GB card runs UD-Q3_K_M clean or UD-Q4_K_M with KV-cache offload to RAM. An 8GB card plus 64GB of system RAM runs it via llama.cpp hybrid offload โ slower, but usable for agent loops per Unsloth’s docs.
Real throughput. Amine Raji’s benchmark on an RTX 3090 at UD-Q4_K_XL: 101.7 tok/s short-prompt, 80.9 tok/s long-prompt. That’s about 30% slower than 3.5-35B-A3B on the same card โ a Gated DeltaNet gap in current llama.cpp, not a model regression. Expect it to close in the next few llama.cpp releases.
Where it disappoints. Community reports have been direct: on long multi-step agent runs, the MoE variant has been described as “getting lost as the task requires more steps.” Cross-turn consistency under tight system prompts has drifted in some harnesses. The structural explanation: tokens route to different experts, each handles its slice well, but the whole-task coherence softens. For dense-model strictness on multi-step agentic work, the 27B is the better pick if you have the VRAM.
ollama pull qwen3.6:35b-a3b
DeepSeek V4-Flash for Coding
DeepSeek V4 preview dropped the evening of April 23, 2026 as two MoE checkpoints. V4-Pro at 1.6T / 49B active is workstation-or-server hardware territory. V4-Flash at 284B / 13B active is the story for everyone else. Both are MIT-licensed. Both carry a 1M-token context window. Both use FP4 for MoE experts and FP8 elsewhere.
DeepSeek API pricing, per DeepSeek’s official pricing page:
| Model | Input (cache miss) | Input (cache hit) | Output |
|---|---|---|---|
| V4-Flash | $0.14/M | $0.028/M | $0.28/M |
| V4-Pro | $1.74/M | $0.145/M | $3.48/M |
$0.14 per million input tokens is Haiku-tier. For agents and tool-calling pipelines that burn tokens fast, this is basically free by frontier standards.
Early community reports. An r/LocalLLaMA thread titled “Tested Deepseek v4 flash with some large code change evals” (posted within 24 hours of release) reports Flash handles multi-file refactors in the same ballpark as Claude Haiku with the advantage of a 1M context window for whole-repo tasks. Vals AI’s Vibe Code Benchmark showed V4 “overwhelmingly” topping open-source models with roughly a 10x jump from V3.2 โ Vals AI’s framing, not an independent re-run. LM Arena and Aider Polyglot will catch up over the following week. Treat everything as early-stage.
Hardware reality for local. V4-Pro at 1.6T is not a homelab story โ ~800GB at Q4, plus KV cache, plus activation memory. V4-Flash at 284B/13B active is ~150GB of weights. That’s still not laptop territory, but it’s in range for:
- Two RTX 6000 Ada at 48GB each
- Mac Studio M3 Ultra with 512GB unified memory
- A ThreadRipper box with enough DDR5 channels for llama.cpp hybrid offload
No community GGUFs or Unsloth Dynamic 2.0 quants as of April 24. vLLM supports the native FP4/FP8 checkpoints โ see the full breakdown in the DeepSeek V4 Flash vs Pro guide.
How to use it. Test Flash through the DeepSeek API first. $0.14/M input is low enough to run a full week of coding agent work for a few dollars. If it outperforms your current Haiku setup, and you have the hardware to run it locally for privacy, then download weights. For most readers, API-first is the honest answer. Local V4-Flash is a homelab story, not a consumer-GPU story.
Why Code Locally?
Four reasons developers are switching, and they got stronger in April 2026:
Your code stays private. Copilot, ChatGPT, and Claude all route through corporate servers. Local models don’t. If you’re working on proprietary code, client projects, or anything under an NDA, local is the only correct answer.
No recurring costs, no policy shifts. Anthropic just proved the point. April 4, 2026: Claude Pro subscriptions no longer cover OpenClaw and other third-party harnesses. If you were running agents on a flat-rate plan, your costs jumped overnight. Local models are free after the hardware, and no email is going to tell you next month that your workflow is no longer covered.
Works offline. Planes, coffee shops with bad WiFi, air-gapped environments, ISP outages. Local models don’t care.
No rate limits. No throttling during peak hours, no tier downgrades, no feature removals without notice. You control the version and the config.
What Makes a Good Coding Model
Four traits matter:
Code completion (FIM). Fill-in-the-middle support means the model can complete code given both the text before and after the cursor. This is what powers inline autocomplete. Qwen 2.5-Coder and DeepSeek-Coder support FIM. Qwen 3.5 and 3.6 general-purpose models do not โ they’re for chat and agentic coding, not tab-complete.
Instruction following. “Refactor this function,” “write tests for this class.” The model needs to follow natural-language instructions about code precisely.
Multi-language support. Python, JavaScript/TypeScript, Rust, Go, C++, SQL. Most developers touch at least three.
Context window that matters. 8K is a floor, 32K+ is what you want, 262K (Qwen 3.6) or 1M (DeepSeek V4) stops being a spec sheet number and becomes useful when you want to hand a model an entire module.
The Benchmarks That Matter
| Benchmark | What It Tests | Why It Matters |
|---|---|---|
| SWE-bench Verified | Real GitHub issues resolved end-to-end | The agentic coding benchmark that matters now |
| HumanEval / HumanEval+ | Generate correct Python functions from docstrings | Standard for code generation; HumanEval+ catches edge-case failures |
| LiveCodeBench v6 | Coding-contest problems with code-repair | Tests harder, realistic tasks |
| Terminal-Bench 2.0 | End-to-end terminal task completion | Tests tool use, not just code gen |
| Aider benchmark | Code editing across a real repo | Matches how people actually use coding agents |
SWE-bench Verified is the number that has replaced HumanEval in 2026 discussions. HumanEval saturated. SWE-bench didn’t.
Best Models by VRAM Tier
8GB VRAM (RTX 4060, 3070, 3060 Ti)
This is the most common GPU tier. The 7B Qwen 2.5-Coder still owns FIM autocomplete here. For chat-based coding, Qwen 3.5-9B is the best fit at this size โ the 27B and 35B-A3B are out of reach without heavy offload.
| Model | HumanEval | FIM | Context | VRAM (Q4) | Best For |
|---|---|---|---|---|---|
| Qwen 2.5 Coder 7B | 88.4% | Yes | 128K | ~5 GB | Best for autocomplete/FIM |
| Qwen 3.5 9B โญ | โ | No | 262K | 6.6 GB | Best for chat coding, multimodal |
| DeepSeek Coder V2 Lite | 81.1% | Yes | 128K | ~5 GB | Reasoning-heavy tasks |
| DeepSeek Coder 6.7B | ~65% | Yes | 16K | ~4.5 GB | Lightweight, fast |
| CodeLlama 7B | ~30% | Yes | 16K | ~4.5 GB | Legacy. Skip. |
For autocomplete: Qwen 2.5-Coder 7B. Still the FIM king. 88.4% HumanEval at 7B. FIM support, 128K context, 92+ languages. Nothing at 7B has displaced it.
ollama pull qwen2.5-coder:7b
For chat coding: Qwen 3.5-9B. Native multimodal, 262K context, thinking mode. 6.6GB at Q4 on Ollama. Run Qwen 2.5-Coder 7B for tab-complete and swap to 3.5-9B for “refactor this module” conversations. They don’t need to run simultaneously.
ollama run qwen3.5:9b
If you want Qwen 3.6-35B-A3B at this tier, you need 64GB+ of system RAM and patience for llama.cpp hybrid offload. Usable for agent loops, not for tab-complete.
16GB VRAM (RTX 3060 12GB, 5060 Ti 16GB, 4060 Ti 16GB)
This is where Qwen 3.6-35B-A3B becomes the new default. Before 3.6, the 16GB answer was Qwen 2.5-Coder 14B for autocomplete and maybe a 27B squeezed with heavy quantization. Now you can run a 35B MoE at UD-Q3_K_M (16.6GB) or UD-Q4_K_M (22.1GB with KV offload).
| Model | SWE-bench / HumanEval | FIM | Context | VRAM (Q4) | Best For |
|---|---|---|---|---|---|
| Qwen 3.6 35B-A3B โญ | 73.4% SWE-bench V | No | 262K | ~16-22 GB | New default at this tier |
| Qwen 2.5 Coder 14B | 89% HumanEval | Yes | 128K | ~9 GB | Best FIM at 16GB |
| Qwen 3.5 Coder Next (if tool-calling) | โ | Yes | 256K | ~16 GB | When tool calls matter |
| DeepSeek Coder 33B (Q3) | 70% HumanEval | Yes | 16K | ~16 GB | Legacy squeeze |
The play at 16GB: Qwen 2.5-Coder 14B for tab-complete, Qwen 3.6-35B-A3B for chat and agentic coding. The Coder 14B needs only ~9GB so both can coexist if you’re careful with context.
ollama pull qwen2.5-coder:14b
ollama pull qwen3.6:35b-a3b
24GB VRAM (RTX 3090, 4090, 5090)
The new default here is Qwen 3.6-27B dense. A used RTX 3090 at $700-850 now runs a model that Qwen claims matches Sonnet 4.6 on SWE-bench Verified.
| Model | SWE-bench / HumanEval | FIM | Context | VRAM (Q4) | Best For |
|---|---|---|---|---|---|
| Qwen 3.6 27B โญ | 77.2% SWE-bench V | No | 262K | ~17 GB | New default; agentic coding |
| Qwen 3.6 35B-A3B | 73.4% SWE-bench V | No | 262K | ~22 GB | Faster; general-purpose |
| Qwen 2.5 Coder 32B | 92.7% HumanEval | Yes | 128K | ~20 GB | Best FIM at 24GB |
| Qwen 3.5 27B | 72.4% SWE-bench V | No | 262K | ~16 GB | Previous-gen, still solid |
| DeepSeek Coder 33B | 70% HumanEval | Yes | 16K | ~20 GB | Older but workable |
For agentic coding: Qwen 3.6-27B. Per the model card, SWE-bench Verified 77.2, Terminal-Bench 2.0 59.3, LiveCodeBench v6 83.9. Fits at ~17GB Q4, leaves room for big context on a 24GB card. No FIM โ pair it with the 2.5-Coder 32B for that.
ollama pull qwen3.6:27b
For autocomplete: Qwen 2.5-Coder 32B. Still the FIM king at 24GB. 92.7% HumanEval, 73.7 on Aider, 128K context, ~20GB at Q4.
ollama pull qwen2.5-coder:32b
For speed and general-purpose: Qwen 3.6-35B-A3B. 101 tok/s on a 3090 per Amine Raji’s benchmark. If you’re running mixed chat + coding + general queries and don’t want to swap models, this is the one-model setup. Note: ~22GB at Q4 means you won’t fit it alongside another model on a 24GB card.
The practical 24GB layout: Qwen 2.5-Coder 32B for FIM, Qwen 3.6-27B for chat and agentic work. Swap between them. They don’t need to run simultaneously.
32GB+ / 48GB+ / Multi-GPU
At 32GB, everything opens up. Qwen 3.6-27B at Q6 or Q8 for higher quality, plus room for a FIM model alongside. Gemma 4-31B holds up as an alternative for specific language strengths. GLM 4.7 Flash is worth testing if you have access.
At 48GB+ (or a Mac with 64GB+ unified memory), Qwen3-Coder-Next is the agentic play โ 80B MoE, 3B active, SWE-rebench Pass@5 at 64.6% (#1 overall as of release). 256K context extendable to 1M via YaRN. ~35-40GB at Q4.
ollama pull qwen3-coder-next
Multi-GPU (two RTX 6000 Ada at 48GB, dual 3090s on NVLink, Mac Studio M3 Ultra 512GB): DeepSeek V4-Flash becomes a real option. 284B/13B active, 1M context, MIT, Haiku-tier API pricing for testing first. Also the tier where vLLM with Qwen 3.6-27B starts making sense for multi-user serving.
Apple Silicon
Mac gets its own guide: Best Local LLMs for Mac in 2026. Short version: M4 Pro 64GB runs Qwen 3.6-27B comfortably on MLX. M3 Ultra 512GB is the only consumer machine that runs DeepSeek V4-Flash locally without offload. For coding specifically, the M-series answer tracks the NVIDIA answer โ 3.6-27B for agentic work, 2.5-Coder for FIM.
The Master Comparison
Every model side by side:
| Model | Params | SWE-bench V | HumanEval | VRAM (Q4) | FIM | Context | License |
|---|---|---|---|---|---|---|---|
| DeepSeek V4-Flash | 284B (13B active) | โ (claimed strong) | โ | ~150 GB | No | 1M | MIT |
| Qwen3-Coder-Next | 80B (3B active) | 70.6% (rebench P@5: 64.6%) | โ | ~38 GB | No | 256K | Apache 2.0 |
| Qwen 3.6 35B-A3B โญ | 35B (3B active) | 73.4% | โ | ~22 GB | No | 262K | Apache 2.0 |
| Qwen 3.6 27B โญ | 27B | 77.2% | โ | ~17 GB | No | 262K | Apache 2.0 |
| Qwen 3.5 35B-A3B | 35B (3B active) | 69.2% | โ | ~20 GB | No | 262K | Apache 2.0 |
| Qwen 3.5 27B | 27B | 72.4% | โ | ~16 GB | No | 262K | Apache 2.0 |
| Qwen 2.5 Coder 32B | 32B | โ | 92.7% | ~20 GB | Yes | 128K | Apache 2.0 |
| Qwen 2.5 Coder 14B | 14B | โ | ~89% | ~9 GB | Yes | 128K | Apache 2.0 |
| Qwen 2.5 Coder 7B | 7B | โ | 88.4% | ~5 GB | Yes | 128K | Apache 2.0 |
| Qwen 3.5 9B | 9B | โ | โ | 6.6 GB | No | 262K | Apache 2.0 |
| Qwen 3.5 4B | 4B | โ | โ | 3.4 GB | No | 262K | Apache 2.0 |
| DS Coder V2 Lite | 16B (2.4B active) | โ | 81.1% | ~5 GB | Yes | 128K | MIT |
| DS Coder 33B (V1) | 33B | โ | 70% | ~20 GB | Yes | 16K | Permissive |
| CodeLlama 34B | 34B | โ | 53.7% | ~20 GB | Yes | 16K | Llama |
SWE-bench Verified numbers for Qwen 3.6 and 3.5 models are from the respective model cards on Hugging Face. HumanEval numbers are from the Qwen 2.5-Coder and DeepSeek-Coder announcements. Treat model-card numbers as “what the authors claim” until you see independent runs.
โ Check what fits your hardware with our Planning Tool.
Which Should You Actually Use?
The decision tree for April 2026:
Daily coding on a 24GB GPU: Qwen 3.6-27B for chat, agentic work, and refactors. Qwen 2.5-Coder 32B for tab-complete if you want FIM. Swap between them.
Constrained hardware (16GB VRAM, or 8GB + 64GB RAM): Qwen 3.6-35B-A3B. Runs on the MoE activation pattern at usable speeds even with RAM offload. Pair with Qwen 2.5-Coder 14B for FIM if you have VRAM headroom.
Agentic Claude Code replacement: Community consensus from the week of April 22 is Qwen 3.6-27B dense if you have 24GB+, Qwen 3.6-35B-A3B if you don’t. Point Claude Code, OpenCode, or Aider at a local endpoint and go.
Tool-calling-heavy workflows: Test DeepSeek V4-Flash via the DeepSeek API first ($0.14/M input makes it cost-free to evaluate). If it works for your pipeline and you have the hardware, run weights locally. Otherwise stay on API.
Maximum privacy, fully local, nothing leaving the machine: Qwen 3.6-27B on 24GB, Qwen 3.6-35B-A3B on 16GB. Apache 2.0 both, weights on disk, no cloud calls, no policy changes.
FIM tab-complete specifically: Qwen 2.5-Coder at whatever size fits your card. Nothing has displaced it. Qwen 3.x is not a FIM family.
Switching from Claude Code to Local
April 4, 2026 was the breaking point for a lot of people. Anthropic’s policy change on third-party harnesses pushed OpenClaw users off flat-rate Claude Pro and onto per-token billing. Claude Code itself remains covered by the subscription, but the broader effect was a wave of “what’s the local story now?” posts on r/LocalLLaMA.
Practical setup:
Harness: Claude Code (if still accessible to you) or OpenCode pointed at a local OpenAI-compatible endpoint. Aider works too โ the Claude Code alternatives guide covers the tradeoffs.
Backend: Ollama for simplicity, llama.cpp for control, vLLM for multi-user or long-context throughput. See llama.cpp vs Ollama vs vLLM for which fits your setup.
Model: Qwen 3.6-27B if you have 24GB+. Qwen 3.6-35B-A3B if you don’t. Both are Apache 2.0, both support tool-calling out of the box through the qwen3_coder tool-call parser in SGLang and vLLM.
Honest framing: Local won’t match frontier cloud for the hardest multi-file cross-repo refactors. It will handle the 80%+ of day-to-day coding work โ single-file edits, bug fixes, test generation, boilerplate, explanation, refactoring โ based on the community reports from the week of April 22. That’s the realistic bar. If your bar is “replaces Claude Opus for the worst case,” you’re not there yet on consumer hardware.
Best Model by Language
All models above are multi-language, but some have particular strengths.
| Language | Best Local Model | Notes |
|---|---|---|
| Python | Qwen 2.5-Coder (any size) or Qwen 3.6-27B | Python is the best-benchmarked language across every coding model |
| JavaScript/TypeScript | Qwen 2.5-Coder 14B+ or Qwen 3.6-27B | Strong JS/TS; 3.6-27B handles modern TS generics well |
| Rust | Qwen 3.6-27B or Qwen 2.5-Coder 32B | Smaller models struggle with borrow checker; dense 27B handles it |
| Go | Qwen 2.5-Coder 14B+ | Clean Go output from 14B up |
| C/C++ | DeepSeek Coder 33B or Qwen 3.6-27B | Strong low-level memory patterns |
| Java | Qwen 2.5-Coder 14B+ | Good boilerplate generation |
| SQL | Qwen 2.5-Coder (any size) | 82% on Spider โ well ahead of competitors |
The honest caveat: For Python and JavaScript, the 7B Qwen 2.5-Coder is genuinely excellent. For Rust, C++, and other complex compiled languages, bigger models produce noticeably better results. If Rust is your primary language and you only have 8GB, expect some friction.
How to Set Up Local Coding in Your Editor
Option 1: VS Code + Ollama + Continue
This is the free, open-source Copilot replacement.
Step 1: Install Ollama
Follow the Ollama setup guide. One command on any OS.
Step 2: Pull your coding model
# Pick your tier:
ollama pull qwen2.5-coder:7b # 8GB VRAM (FIM)
ollama pull qwen2.5-coder:14b # 16GB VRAM (FIM)
ollama pull qwen3.6:35b-a3b # 16GB+ (chat/agentic)
ollama pull qwen3.6:27b # 24GB+ (chat/agentic)
Step 3: Install Continue extension
Open VS Code โ Extensions (Ctrl+Shift+X) โ Search “Continue” โ Install.
Step 4: Configure Continue
Edit ~/.continue/config.yaml:
name: Local Coding
version: 0.0.1
schema: v1
models:
- uses: ollama/qwen3.6-27b
- uses: ollama/qwen2.5-coder-14b
role: autocomplete
Step 5: Code
- Chat: Click the Continue icon, ask questions about your code
- Autocomplete: Start typing
- Edit: Select code, press Ctrl+I, describe the change
Option 2: Aider (Terminal)
Aider lives in your terminal, edits files in your git repo, and auto-commits.
pip install aider-chat
aider --model ollama/qwen3.6:27b
Best for developers who live in the terminal. Mature, well-documented, and handles repo-wide context through a repo map.
Option 3: OpenCode
OpenCode is the Go-based terminal agent that picked up a lot of ex-Claude Code users after April 4. Multi-editor support. Point it at a local OpenAI-compatible endpoint.
Option 4: Claude Code Against a Local Endpoint
Several r/LocalLLaMA posts from late April 2026 report running Claude Code against a local model via an OpenAI-compatible proxy. Quality varies by harness version. Works best with Qwen 3.6-27B at Q4_K_M or better.
Practical Tips
FIM vs. Chat: Know the Difference
FIM (Fill-in-the-Middle) powers inline autocomplete โ the cursor sits in the middle of your code and the model predicts what goes there. Qwen 2.5-Coder and DeepSeek-Coder support FIM. Qwen 3.5 and 3.6 do not โ they’re for chat and agentic coding.
Chat mode is for conversations, refactors, and multi-step agentic work. Most editors let you use both simultaneously.
Sampling Parameters Matter
Qwen 3.6’s card recommends specific settings for coding in thinking mode: temperature 0.6, top_p 0.95, top_k 20, presence_penalty 0.0. For general thinking-mode tasks: temperature 1.0, top_p 0.95, presence_penalty 0.0. For instruct (non-thinking) mode: temperature 0.7, top_p 0.80, presence_penalty 1.5. Ignore these and you get worse output. Full guidance on the Qwen 3.6 model card.
Keep a Small Model for Autocomplete
Autocomplete needs to be fast โ under 200ms ideally. On 8GB, Qwen 2.5-Coder 7B handles both chat and FIM fine. On 16-24GB, run a smaller model (Qwen 2.5-Coder 1.5B or 3B) for autocomplete and a bigger one for chat.
Context Window vs. VRAM
Bigger context eats VRAM. On 8GB, stick to 4-8K for coding. On 16GB, push to 16-32K. On 24GB, 64K+ is comfortable. Qwen 3.6’s 262K context is a ceiling, not a target โ loading it all consumes a lot of KV cache.
For navigating large codebases, quantization at Q4_K_S instead of Q4_K_M saves a few hundred MB that can go toward context.
Close Your Browser
Chrome’s hardware acceleration eats GPU memory. Close it or disable GPU acceleration when running local models on 8-16GB cards.
The Bottom Line
Local coding now covers three distinct jobs, and the answers shifted hard in late April 2026:
- Autocomplete/FIM: Qwen 2.5-Coder (7B/14B/32B) โ still unmatched at every tier for tab-complete. The 3.6 family is not a FIM family.
- Chat and agentic coding: Qwen 3.6-27B dense if you have 24GB+, Qwen 3.6-35B-A3B if you have 16GB or want RAM offload. Both Apache 2.0, both 262K context, both native tool-calling.
- Frontier-adjacent cloud testing: DeepSeek V4-Flash via API at $0.14/M input. Local only if you have serious homelab hardware.
The 24GB setup for April 2026:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# FIM autocomplete
ollama pull qwen2.5-coder:32b
# Chat + agentic coding
ollama pull qwen3.6:27b
# Install Continue in VS Code, point both models at it, and code
No subscriptions. No policy changes that cut you off next week. No tokens leaving your machine. Just you, your code, and models that genuinely match what cloud assistants did a year ago.
Related Guides
- Qwen 3.6 Complete Guide
- DeepSeek V4 Flash vs Pro
- Local Claude Code Alternatives
- Qwen 3.5 Small Models: 9B Beats 30B
- llama.cpp vs Ollama vs vLLM
- Best Local LLMs for Mac in 2026
- VRAM Requirements for Local LLMs
- GPU Buying Guide for Local AI
Sources: Qwen3.6-27B Model Card, Qwen3.6-35B-A3B Model Card, Unsloth Qwen3.6-27B GGUF, Unsloth Qwen3.6-35B-A3B GGUF, DeepSeek V4-Pro Model Card, DeepSeek API Pricing, Artificial Analysis Intelligence Index, Amine Raji llama.cpp Qwen 3.6 benchmark, Simon Willison, Qwen2.5-Coder Technical Report, Qwen3-Coder-Next Model Card, SWE-rebench Leaderboard, Continue.dev Ollama Guide, EvalPlus Leaderboard
Get notified when we publish new guides.
Subscribe โ free, no spam