๐Ÿ“š More on this topic: Qwen 3.6 Complete Guide ยท DeepSeek V4 Flash vs Pro ยท Local Claude Code Alternatives ยท llama.cpp vs Ollama vs vLLM ยท VRAM Requirements

GitHub Copilot costs $10-19/month. ChatGPT Plus is $20. Claude Pro is $20, and as of April 4, 2026, it no longer covers third-party agent harnesses like OpenClaw โ€” Anthropic pushed those users to per-token API billing (context here). That change pushed a wave of “switch to local” posts on r/LocalLLaMA, right as the local options got a lot better.

Two releases inside 48 hours reset this list. Qwen3.6-27B dense landed April 22 with Qwen claiming “flagship-level agentic coding” in a 27B model. DeepSeek V4-Flash landed the evening of April 23 โ€” 284B total / 13B active, MIT-licensed, priced at Haiku-tier rates through the DeepSeek API. Add the mid-April Qwen3.6-35B-A3B MoE that runs on a 16GB card with RAM offload and the picture looks different than it did three weeks ago.

This guide covers which model to run at every VRAM tier, what the benchmarks say vs what community users report, and how to wire a local coding stack up in your editor.


What’s New (May 2026)

The April 24 update brought Qwen 3.6 and DeepSeek V4 into this list. Two more weeks of release cadence have not slowed down. Here’s what changed since.

Gemma 4 26B-A4B MoE. Google’s first MoE that actually competes for the local-coding seat. 26B total, 4B active, 1441 LMArena. An r/LocalLLaMA thread today clocked a 5090 user at 600 tok/s on this model with vLLM and aggressive batching. The hardware fit is what makes it interesting: 4B active means the inference math is cheap, and decoupled-attention runtimes like LARQL run it on hardware that should not handle 26B weights. The Gemma 4 local guide covers VRAM and quants.

DeepSeek V4 cost economics got real. V4 Pro hit FoodTruck Bench on May 5 at 17x cheaper than GPT-5.2 for the same evaluation pass. V4 Flash remains the low-latency pick for tool-calling agent loops. The V4 Flash vs Pro guide has the breakdown.

Speculative decoding moved from research to production. I built and benched DFlash on my own RTX 3090 at 2.56x mean speedup on Qwen 3.6-27B Q4_K_M. Mainline llama.cpp added MTP for Qwen 3.6 via PR #22673 on May 4. The DFlash vs MTP head-to-head covers which fits which workload.

The community verdict. A May 8 r/ollama thread directly compared Qwen 3.6, Qwen3-Coder, and DeepSeek-Coder. The Qwen 3.6 family won across the board on local hardware. The 27B dense for tool-using coding agents, the 35B-A3B MoE for general-purpose work on tighter VRAM. If you’re not running Qwen, DeepSeek, or Gemma 4 in 2026, you’re behind the current generation.

The body of this article still maps out the per-tier picks. The Qwen 3.6 + DeepSeek V4 + Gemma 4 trio is the headline answer.


The Qwen 3.6-27B Moment

Qwen dropped Qwen3.6-27B on April 22, 2026, and labeled it, in their own words, “flagship-level agentic coding performance, surpassing the previous-generation open-source flagship.” The previous flagship was the 397B-A17B MoE. A 27B dense model claiming to beat a 397B MoE on coding is either marketing or a real step change. The early returns suggest a real step change, with caveats.

Qwen’s own benchmarks (per the Qwen3.6-27B model card):

BenchmarkQwen3.6-27B
SWE-bench Verified77.2
SWE-bench Pro53.5
SWE-bench Multilingual71.3
Terminal-Bench 2.059.3
LiveCodeBench v683.9
MMLU-Pro86.2

SWE-bench Verified at 77.2 is the headline. That puts the 27B in the same range as Claude Sonnet 4.6 on real-world GitHub-issue resolution, per Qwen’s own comparison. Independent AA Agentic Index placement is still settling โ€” the Artificial Analysis Intelligence Index currently shows Claude Opus 4.7 at 57 and DeepSeek V4 Pro at 52 in the top 10, with Qwen 3.6-27B not yet ranked there. Treat Qwen’s own numbers as the model card, not as independent verification.

What the community actually saw. Simon Willison’s Qwen 3.6-27B post benchmarked the Q4_K_M GGUF (16.8GB) on llama-server at 65K context: 25.57 tok/s generating a pelican-on-a-bicycle SVG, which he called “an outstanding result for a 16.8GB local model.” r/LocalLLaMA posts from the week of April 22 report about 50 tok/s on an RTX 5090 at Q6_K with 200K context loaded. An RTX 5080 at 16GB with aggressive quantization comes in closer to 6 tok/s โ€” usable, not fast.

Claude Code / OpenCode compatibility. Multiple r/LocalLLaMA threads in the week after release report pointing Claude Code and OpenCode at a local Qwen3.6-27B endpoint and getting genuinely useful agentic runs โ€” “vibe codes perfectly fine” in the phrasing one commenter used. That’s week-one harness feedback, not a verdict, but the pattern is consistent across posts.

Honest caveats. The r/LocalLLaMA reaction split. Skeptics pointed at the pelican benchmark and asked whether Qwen trained on the test. Hacker News commenters raised Goodhart’s Law concerns and flagged tool-use failures where the model repeats failed actions without reading back context. Dense architecture means higher VRAM cost per unit capability than an MoE at the same capability class. And the sampling parameters matter โ€” a lot. Qwen’s card recommends temperature 0.6, top_p 0.95, top_k 20, presence_penalty 0.0 for “precise coding” in thinking mode. Ignore those and you get a different model.

Bottom line on 3.6-27B. Strong at what it was trained for. That covers most of what people actually use a coding model for. Not a drop-in frontier-cloud replacement for the hardest cases. For a 24GB GPU, it’s the new default.

ollama pull qwen3.6:27b

Qwen 3.6-35B-A3B: Runs on 16GB

The MoE sibling shipped mid-April. 35B total parameters, 3B active per token, 256 experts with 8 routed + 1 shared active per token. Apache 2.0, 262K context, same hybrid DeltaNet + Attention stack as the 27B. The story here is the VRAM math: the model loads 35B of weights into memory but only fires 3B on each token, so throughput is fast and the memory footprint is manageable.

Qwen’s own benchmarks (per the Qwen3.6-35B-A3B model card):

BenchmarkQwen3.6-35B-A3B
SWE-bench Verified73.4
SWE-bench Pro49.5
Terminal-Bench 2.051.5
LiveCodeBench v680.4

Same caveat as above: Qwen’s numbers. Directionally matches the early community reports.

What it runs on. Per the Unsloth Qwen3.6-35B-A3B GGUF card:

QuantFile size
UD-IQ1_M10 GB
UD-Q3_K_M16.6 GB
UD-Q4_K_M22.1 GB
Q8_036.9 GB

A 24GB card runs UD-Q4_K_M clean. A 16GB card runs UD-Q3_K_M clean or UD-Q4_K_M with KV-cache offload to RAM. An 8GB card plus 64GB of system RAM runs it via llama.cpp hybrid offload โ€” slower, but usable for agent loops per Unsloth’s docs.

Real throughput. Amine Raji’s benchmark on an RTX 3090 at UD-Q4_K_XL: 101.7 tok/s short-prompt, 80.9 tok/s long-prompt. That’s about 30% slower than 3.5-35B-A3B on the same card โ€” a Gated DeltaNet gap in current llama.cpp, not a model regression. Expect it to close in the next few llama.cpp releases.

Where it disappoints. Community reports have been direct: on long multi-step agent runs, the MoE variant has been described as “getting lost as the task requires more steps.” Cross-turn consistency under tight system prompts has drifted in some harnesses. The structural explanation: tokens route to different experts, each handles its slice well, but the whole-task coherence softens. For dense-model strictness on multi-step agentic work, the 27B is the better pick if you have the VRAM.

ollama pull qwen3.6:35b-a3b

DeepSeek V4-Flash for Coding

DeepSeek V4 preview dropped the evening of April 23, 2026 as two MoE checkpoints. V4-Pro at 1.6T / 49B active is workstation-or-server hardware territory. V4-Flash at 284B / 13B active is the story for everyone else. Both are MIT-licensed. Both carry a 1M-token context window. Both use FP4 for MoE experts and FP8 elsewhere.

DeepSeek API pricing, per DeepSeek’s official pricing page:

ModelInput (cache miss)Input (cache hit)Output
V4-Flash$0.14/M$0.028/M$0.28/M
V4-Pro$1.74/M$0.145/M$3.48/M

$0.14 per million input tokens is Haiku-tier. For agents and tool-calling pipelines that burn tokens fast, this is basically free by frontier standards.

Early community reports. An r/LocalLLaMA thread titled “Tested Deepseek v4 flash with some large code change evals” (posted within 24 hours of release) reports Flash handles multi-file refactors in the same ballpark as Claude Haiku with the advantage of a 1M context window for whole-repo tasks. Vals AI’s Vibe Code Benchmark showed V4 “overwhelmingly” topping open-source models with roughly a 10x jump from V3.2 โ€” Vals AI’s framing, not an independent re-run. LM Arena and Aider Polyglot will catch up over the following week. Treat everything as early-stage.

Hardware reality for local. V4-Pro at 1.6T is not a homelab story โ€” ~800GB at Q4, plus KV cache, plus activation memory. V4-Flash at 284B/13B active is ~150GB of weights. That’s still not laptop territory, but it’s in range for:

  • Two RTX 6000 Ada at 48GB each
  • Mac Studio M3 Ultra with 512GB unified memory
  • A ThreadRipper box with enough DDR5 channels for llama.cpp hybrid offload

No community GGUFs or Unsloth Dynamic 2.0 quants as of April 24. vLLM supports the native FP4/FP8 checkpoints โ€” see the full breakdown in the DeepSeek V4 Flash vs Pro guide.

How to use it. Test Flash through the DeepSeek API first. $0.14/M input is low enough to run a full week of coding agent work for a few dollars. If it outperforms your current Haiku setup, and you have the hardware to run it locally for privacy, then download weights. For most readers, API-first is the honest answer. Local V4-Flash is a homelab story, not a consumer-GPU story.


Why Code Locally?

Four reasons developers are switching, and they got stronger in April 2026:

Your code stays private. Copilot, ChatGPT, and Claude all route through corporate servers. Local models don’t. If you’re working on proprietary code, client projects, or anything under an NDA, local is the only correct answer.

No recurring costs, no policy shifts. Anthropic just proved the point. April 4, 2026: Claude Pro subscriptions no longer cover OpenClaw and other third-party harnesses. If you were running agents on a flat-rate plan, your costs jumped overnight. Local models are free after the hardware, and no email is going to tell you next month that your workflow is no longer covered.

Works offline. Planes, coffee shops with bad WiFi, air-gapped environments, ISP outages. Local models don’t care.

No rate limits. No throttling during peak hours, no tier downgrades, no feature removals without notice. You control the version and the config.


What Makes a Good Coding Model

Four traits matter:

Code completion (FIM). Fill-in-the-middle support means the model can complete code given both the text before and after the cursor. This is what powers inline autocomplete. Qwen 2.5-Coder and DeepSeek-Coder support FIM. Qwen 3.5 and 3.6 general-purpose models do not โ€” they’re for chat and agentic coding, not tab-complete.

Instruction following. “Refactor this function,” “write tests for this class.” The model needs to follow natural-language instructions about code precisely.

Multi-language support. Python, JavaScript/TypeScript, Rust, Go, C++, SQL. Most developers touch at least three.

Context window that matters. 8K is a floor, 32K+ is what you want, 262K (Qwen 3.6) or 1M (DeepSeek V4) stops being a spec sheet number and becomes useful when you want to hand a model an entire module.

The Benchmarks That Matter

BenchmarkWhat It TestsWhy It Matters
SWE-bench VerifiedReal GitHub issues resolved end-to-endThe agentic coding benchmark that matters now
HumanEval / HumanEval+Generate correct Python functions from docstringsStandard for code generation; HumanEval+ catches edge-case failures
LiveCodeBench v6Coding-contest problems with code-repairTests harder, realistic tasks
Terminal-Bench 2.0End-to-end terminal task completionTests tool use, not just code gen
Aider benchmarkCode editing across a real repoMatches how people actually use coding agents

SWE-bench Verified is the number that has replaced HumanEval in 2026 discussions. HumanEval saturated. SWE-bench didn’t.


Best Models by VRAM Tier

8GB VRAM (RTX 4060, 3070, 3060 Ti)

This is the most common GPU tier. The 7B Qwen 2.5-Coder still owns FIM autocomplete here. For chat-based coding, Qwen 3.5-9B is the best fit at this size โ€” the 27B and 35B-A3B are out of reach without heavy offload.

ModelHumanEvalFIMContextVRAM (Q4)Best For
Qwen 2.5 Coder 7B88.4%Yes128K~5 GBBest for autocomplete/FIM
Qwen 3.5 9B โญโ€”No262K6.6 GBBest for chat coding, multimodal
DeepSeek Coder V2 Lite81.1%Yes128K~5 GBReasoning-heavy tasks
DeepSeek Coder 6.7B~65%Yes16K~4.5 GBLightweight, fast
CodeLlama 7B~30%Yes16K~4.5 GBLegacy. Skip.

For autocomplete: Qwen 2.5-Coder 7B. Still the FIM king. 88.4% HumanEval at 7B. FIM support, 128K context, 92+ languages. Nothing at 7B has displaced it.

ollama pull qwen2.5-coder:7b

For chat coding: Qwen 3.5-9B. Native multimodal, 262K context, thinking mode. 6.6GB at Q4 on Ollama. Run Qwen 2.5-Coder 7B for tab-complete and swap to 3.5-9B for “refactor this module” conversations. They don’t need to run simultaneously.

ollama run qwen3.5:9b

If you want Qwen 3.6-35B-A3B at this tier, you need 64GB+ of system RAM and patience for llama.cpp hybrid offload. Usable for agent loops, not for tab-complete.

16GB VRAM (RTX 3060 12GB, 5060 Ti 16GB, 4060 Ti 16GB)

This is where Qwen 3.6-35B-A3B becomes the new default. Before 3.6, the 16GB answer was Qwen 2.5-Coder 14B for autocomplete and maybe a 27B squeezed with heavy quantization. Now you can run a 35B MoE at UD-Q3_K_M (16.6GB) or UD-Q4_K_M (22.1GB with KV offload).

ModelSWE-bench / HumanEvalFIMContextVRAM (Q4)Best For
Qwen 3.6 35B-A3B โญ73.4% SWE-bench VNo262K~16-22 GBNew default at this tier
Qwen 2.5 Coder 14B89% HumanEvalYes128K~9 GBBest FIM at 16GB
Qwen 3.5 Coder Next (if tool-calling)โ€”Yes256K~16 GBWhen tool calls matter
DeepSeek Coder 33B (Q3)70% HumanEvalYes16K~16 GBLegacy squeeze

The play at 16GB: Qwen 2.5-Coder 14B for tab-complete, Qwen 3.6-35B-A3B for chat and agentic coding. The Coder 14B needs only ~9GB so both can coexist if you’re careful with context.

ollama pull qwen2.5-coder:14b
ollama pull qwen3.6:35b-a3b

24GB VRAM (RTX 3090, 4090, 5090)

The new default here is Qwen 3.6-27B dense. A used RTX 3090 at $700-850 now runs a model that Qwen claims matches Sonnet 4.6 on SWE-bench Verified.

ModelSWE-bench / HumanEvalFIMContextVRAM (Q4)Best For
Qwen 3.6 27B โญ77.2% SWE-bench VNo262K~17 GBNew default; agentic coding
Qwen 3.6 35B-A3B73.4% SWE-bench VNo262K~22 GBFaster; general-purpose
Qwen 2.5 Coder 32B92.7% HumanEvalYes128K~20 GBBest FIM at 24GB
Qwen 3.5 27B72.4% SWE-bench VNo262K~16 GBPrevious-gen, still solid
DeepSeek Coder 33B70% HumanEvalYes16K~20 GBOlder but workable

For agentic coding: Qwen 3.6-27B. Per the model card, SWE-bench Verified 77.2, Terminal-Bench 2.0 59.3, LiveCodeBench v6 83.9. Fits at ~17GB Q4, leaves room for big context on a 24GB card. No FIM โ€” pair it with the 2.5-Coder 32B for that.

ollama pull qwen3.6:27b

For autocomplete: Qwen 2.5-Coder 32B. Still the FIM king at 24GB. 92.7% HumanEval, 73.7 on Aider, 128K context, ~20GB at Q4.

ollama pull qwen2.5-coder:32b

For speed and general-purpose: Qwen 3.6-35B-A3B. 101 tok/s on a 3090 per Amine Raji’s benchmark. If you’re running mixed chat + coding + general queries and don’t want to swap models, this is the one-model setup. Note: ~22GB at Q4 means you won’t fit it alongside another model on a 24GB card.

The practical 24GB layout: Qwen 2.5-Coder 32B for FIM, Qwen 3.6-27B for chat and agentic work. Swap between them. They don’t need to run simultaneously.

32GB+ / 48GB+ / Multi-GPU

At 32GB, everything opens up. Qwen 3.6-27B at Q6 or Q8 for higher quality, plus room for a FIM model alongside. Gemma 4-31B holds up as an alternative for specific language strengths. GLM 4.7 Flash is worth testing if you have access.

At 48GB+ (or a Mac with 64GB+ unified memory), Qwen3-Coder-Next is the agentic play โ€” 80B MoE, 3B active, SWE-rebench Pass@5 at 64.6% (#1 overall as of release). 256K context extendable to 1M via YaRN. ~35-40GB at Q4.

ollama pull qwen3-coder-next

Multi-GPU (two RTX 6000 Ada at 48GB, dual 3090s on NVLink, Mac Studio M3 Ultra 512GB): DeepSeek V4-Flash becomes a real option. 284B/13B active, 1M context, MIT, Haiku-tier API pricing for testing first. Also the tier where vLLM with Qwen 3.6-27B starts making sense for multi-user serving.

Apple Silicon

Mac gets its own guide: Best Local LLMs for Mac in 2026. Short version: M4 Pro 64GB runs Qwen 3.6-27B comfortably on MLX. M3 Ultra 512GB is the only consumer machine that runs DeepSeek V4-Flash locally without offload. For coding specifically, the M-series answer tracks the NVIDIA answer โ€” 3.6-27B for agentic work, 2.5-Coder for FIM.


The Master Comparison

Every model side by side:

ModelParamsSWE-bench VHumanEvalVRAM (Q4)FIMContextLicense
DeepSeek V4-Flash284B (13B active)โ€” (claimed strong)โ€”~150 GBNo1MMIT
Qwen3-Coder-Next80B (3B active)70.6% (rebench P@5: 64.6%)โ€”~38 GBNo256KApache 2.0
Qwen 3.6 35B-A3B โญ35B (3B active)73.4%โ€”~22 GBNo262KApache 2.0
Qwen 3.6 27B โญ27B77.2%โ€”~17 GBNo262KApache 2.0
Qwen 3.5 35B-A3B35B (3B active)69.2%โ€”~20 GBNo262KApache 2.0
Qwen 3.5 27B27B72.4%โ€”~16 GBNo262KApache 2.0
Qwen 2.5 Coder 32B32Bโ€”92.7%~20 GBYes128KApache 2.0
Qwen 2.5 Coder 14B14Bโ€”~89%~9 GBYes128KApache 2.0
Qwen 2.5 Coder 7B7Bโ€”88.4%~5 GBYes128KApache 2.0
Qwen 3.5 9B9Bโ€”โ€”6.6 GBNo262KApache 2.0
Qwen 3.5 4B4Bโ€”โ€”3.4 GBNo262KApache 2.0
DS Coder V2 Lite16B (2.4B active)โ€”81.1%~5 GBYes128KMIT
DS Coder 33B (V1)33Bโ€”70%~20 GBYes16KPermissive
CodeLlama 34B34Bโ€”53.7%~20 GBYes16KLlama

SWE-bench Verified numbers for Qwen 3.6 and 3.5 models are from the respective model cards on Hugging Face. HumanEval numbers are from the Qwen 2.5-Coder and DeepSeek-Coder announcements. Treat model-card numbers as “what the authors claim” until you see independent runs.

โ†’ Check what fits your hardware with our Planning Tool.


Which Should You Actually Use?

The decision tree for April 2026:

Daily coding on a 24GB GPU: Qwen 3.6-27B for chat, agentic work, and refactors. Qwen 2.5-Coder 32B for tab-complete if you want FIM. Swap between them.

Constrained hardware (16GB VRAM, or 8GB + 64GB RAM): Qwen 3.6-35B-A3B. Runs on the MoE activation pattern at usable speeds even with RAM offload. Pair with Qwen 2.5-Coder 14B for FIM if you have VRAM headroom.

Agentic Claude Code replacement: Community consensus from the week of April 22 is Qwen 3.6-27B dense if you have 24GB+, Qwen 3.6-35B-A3B if you don’t. Point Claude Code, OpenCode, or Aider at a local endpoint and go.

Tool-calling-heavy workflows: Test DeepSeek V4-Flash via the DeepSeek API first ($0.14/M input makes it cost-free to evaluate). If it works for your pipeline and you have the hardware, run weights locally. Otherwise stay on API.

Maximum privacy, fully local, nothing leaving the machine: Qwen 3.6-27B on 24GB, Qwen 3.6-35B-A3B on 16GB. Apache 2.0 both, weights on disk, no cloud calls, no policy changes.

FIM tab-complete specifically: Qwen 2.5-Coder at whatever size fits your card. Nothing has displaced it. Qwen 3.x is not a FIM family.


Switching from Claude Code to Local

April 4, 2026 was the breaking point for a lot of people. Anthropic’s policy change on third-party harnesses pushed OpenClaw users off flat-rate Claude Pro and onto per-token billing. Claude Code itself remains covered by the subscription, but the broader effect was a wave of “what’s the local story now?” posts on r/LocalLLaMA.

Practical setup:

Harness: Claude Code (if still accessible to you) or OpenCode pointed at a local OpenAI-compatible endpoint. Aider works too โ€” the Claude Code alternatives guide covers the tradeoffs.

Backend: Ollama for simplicity, llama.cpp for control, vLLM for multi-user or long-context throughput. See llama.cpp vs Ollama vs vLLM for which fits your setup.

Model: Qwen 3.6-27B if you have 24GB+. Qwen 3.6-35B-A3B if you don’t. Both are Apache 2.0, both support tool-calling out of the box through the qwen3_coder tool-call parser in SGLang and vLLM.

Honest framing: Local won’t match frontier cloud for the hardest multi-file cross-repo refactors. It will handle the 80%+ of day-to-day coding work โ€” single-file edits, bug fixes, test generation, boilerplate, explanation, refactoring โ€” based on the community reports from the week of April 22. That’s the realistic bar. If your bar is “replaces Claude Opus for the worst case,” you’re not there yet on consumer hardware.


Best Model by Language

All models above are multi-language, but some have particular strengths.

LanguageBest Local ModelNotes
PythonQwen 2.5-Coder (any size) or Qwen 3.6-27BPython is the best-benchmarked language across every coding model
JavaScript/TypeScriptQwen 2.5-Coder 14B+ or Qwen 3.6-27BStrong JS/TS; 3.6-27B handles modern TS generics well
RustQwen 3.6-27B or Qwen 2.5-Coder 32BSmaller models struggle with borrow checker; dense 27B handles it
GoQwen 2.5-Coder 14B+Clean Go output from 14B up
C/C++DeepSeek Coder 33B or Qwen 3.6-27BStrong low-level memory patterns
JavaQwen 2.5-Coder 14B+Good boilerplate generation
SQLQwen 2.5-Coder (any size)82% on Spider โ€” well ahead of competitors

The honest caveat: For Python and JavaScript, the 7B Qwen 2.5-Coder is genuinely excellent. For Rust, C++, and other complex compiled languages, bigger models produce noticeably better results. If Rust is your primary language and you only have 8GB, expect some friction.


How to Set Up Local Coding in Your Editor

Option 1: VS Code + Ollama + Continue

This is the free, open-source Copilot replacement.

Step 1: Install Ollama

Follow the Ollama setup guide. One command on any OS.

Step 2: Pull your coding model

# Pick your tier:
ollama pull qwen2.5-coder:7b       # 8GB VRAM (FIM)
ollama pull qwen2.5-coder:14b      # 16GB VRAM (FIM)
ollama pull qwen3.6:35b-a3b        # 16GB+ (chat/agentic)
ollama pull qwen3.6:27b            # 24GB+ (chat/agentic)

Step 3: Install Continue extension

Open VS Code โ†’ Extensions (Ctrl+Shift+X) โ†’ Search “Continue” โ†’ Install.

Step 4: Configure Continue

Edit ~/.continue/config.yaml:

name: Local Coding
version: 0.0.1
schema: v1
models:
  - uses: ollama/qwen3.6-27b
  - uses: ollama/qwen2.5-coder-14b
    role: autocomplete

Step 5: Code

  • Chat: Click the Continue icon, ask questions about your code
  • Autocomplete: Start typing
  • Edit: Select code, press Ctrl+I, describe the change

Option 2: Aider (Terminal)

Aider lives in your terminal, edits files in your git repo, and auto-commits.

pip install aider-chat
aider --model ollama/qwen3.6:27b

Best for developers who live in the terminal. Mature, well-documented, and handles repo-wide context through a repo map.

Option 3: OpenCode

OpenCode is the Go-based terminal agent that picked up a lot of ex-Claude Code users after April 4. Multi-editor support. Point it at a local OpenAI-compatible endpoint.

Option 4: Claude Code Against a Local Endpoint

Several r/LocalLLaMA posts from late April 2026 report running Claude Code against a local model via an OpenAI-compatible proxy. Quality varies by harness version. Works best with Qwen 3.6-27B at Q4_K_M or better.


Practical Tips

FIM vs. Chat: Know the Difference

FIM (Fill-in-the-Middle) powers inline autocomplete โ€” the cursor sits in the middle of your code and the model predicts what goes there. Qwen 2.5-Coder and DeepSeek-Coder support FIM. Qwen 3.5 and 3.6 do not โ€” they’re for chat and agentic coding.

Chat mode is for conversations, refactors, and multi-step agentic work. Most editors let you use both simultaneously.

Sampling Parameters Matter

Qwen 3.6’s card recommends specific settings for coding in thinking mode: temperature 0.6, top_p 0.95, top_k 20, presence_penalty 0.0. For general thinking-mode tasks: temperature 1.0, top_p 0.95, presence_penalty 0.0. For instruct (non-thinking) mode: temperature 0.7, top_p 0.80, presence_penalty 1.5. Ignore these and you get worse output. Full guidance on the Qwen 3.6 model card.

Keep a Small Model for Autocomplete

Autocomplete needs to be fast โ€” under 200ms ideally. On 8GB, Qwen 2.5-Coder 7B handles both chat and FIM fine. On 16-24GB, run a smaller model (Qwen 2.5-Coder 1.5B or 3B) for autocomplete and a bigger one for chat.

Context Window vs. VRAM

Bigger context eats VRAM. On 8GB, stick to 4-8K for coding. On 16GB, push to 16-32K. On 24GB, 64K+ is comfortable. Qwen 3.6’s 262K context is a ceiling, not a target โ€” loading it all consumes a lot of KV cache.

For navigating large codebases, quantization at Q4_K_S instead of Q4_K_M saves a few hundred MB that can go toward context.

Close Your Browser

Chrome’s hardware acceleration eats GPU memory. Close it or disable GPU acceleration when running local models on 8-16GB cards.


The Bottom Line

Local coding now covers three distinct jobs, and the answers shifted hard in late April 2026:

  1. Autocomplete/FIM: Qwen 2.5-Coder (7B/14B/32B) โ€” still unmatched at every tier for tab-complete. The 3.6 family is not a FIM family.
  2. Chat and agentic coding: Qwen 3.6-27B dense if you have 24GB+, Qwen 3.6-35B-A3B if you have 16GB or want RAM offload. Both Apache 2.0, both 262K context, both native tool-calling.
  3. Frontier-adjacent cloud testing: DeepSeek V4-Flash via API at $0.14/M input. Local only if you have serious homelab hardware.

The 24GB setup for April 2026:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# FIM autocomplete
ollama pull qwen2.5-coder:32b

# Chat + agentic coding
ollama pull qwen3.6:27b

# Install Continue in VS Code, point both models at it, and code

No subscriptions. No policy changes that cut you off next week. No tokens leaving your machine. Just you, your code, and models that genuinely match what cloud assistants did a year ago.



Sources: Qwen3.6-27B Model Card, Qwen3.6-35B-A3B Model Card, Unsloth Qwen3.6-27B GGUF, Unsloth Qwen3.6-35B-A3B GGUF, DeepSeek V4-Pro Model Card, DeepSeek API Pricing, Artificial Analysis Intelligence Index, Amine Raji llama.cpp Qwen 3.6 benchmark, Simon Willison, Qwen2.5-Coder Technical Report, Qwen3-Coder-Next Model Card, SWE-rebench Leaderboard, Continue.dev Ollama Guide, EvalPlus Leaderboard