Best Local Models for PI Agent: Qwen 3.6, Gemma 4 (2026 Setup)

Quick Answer: PI Agent is Mario Zechner’s MIT-licensed terminal coding agent — point it at any local Ollama model and you’ve got a private coding assistant with zero API costs. This guide covers the current install path, models.json + settings.json configuration, model recommendations across VRAM tiers from 8GB through 48GB+, and the per-task model-switching workflow that makes a small-GPU setup feel responsive. May 2026 picks come from the Qwen 3.6 and Gemma 4 families. Two model-specific tool-calling gotchas have known workarounds covered in the body.

📚 More on this topic: Best Local Coding Models · Qwen Models Family Guide · llama.cpp vs Ollama vs vLLM · Best Local Models for OpenClaw · Local Alternatives to Claude Code

Most coding agents lock you into a single model provider. Claude Code requires Anthropic. GitHub Copilot requires OpenAI. You pay per token, your code goes through someone else’s servers, and you have no control over what model runs underneath.

PI Agent is different. Built by Mario Zechner, PI is a minimal, MIT-licensed terminal coding agent that works with any model from any provider, including local ones running on your own hardware via Ollama. No subscription, no API costs, no code leaving your machine.

This guide covers how to set it up, which local models work best for agentic coding, and what to expect from the experience.

What’s New (May 2026)

PI Agent is still actively used — May 2026 r/LocalLLaMA has PI + Qwen 3.6 setup threads, a PI + Cline-Kanban integration project, and PSA-style writeups on model selection. The article below stays accurate for setup. What’s changed is the model lineup.

May 2026 model picks for PI Agent:

VRAM	Pick	Why
8 GB	Qwen 2.5 7B Coder / Qwen 3.5 9B	Budget floor; covered in the body below
12 GB	Qwen 2.5 14B / Qwen 2.5 Coder 14B	Dense step-up
16 GB	Qwen 3.6-35B-A3B with `--cpu-moe`	MoE math; see Run Qwen 3.6-35B MoE Locally
24 GB	Qwen 3.6-27B or Gemma 4 26B-A4B	New dense / MoE defaults; 27B leads on agentic coding

The realistic minimum for PI Agent + a usable local model: 8 GB if you can live with Qwen 2.5 7B, 16 GB if you want Qwen 3.6 35B-A3B comfortably.

Two tool-calling gotchas worth knowing:

Qwen 3.6 whitespace. Write --chat-template-kwargs '{"enable_thinking":false}' (no space after the colon) or Qwen 3.6 silently rejects it and routes tool calls to the reasoning channel. See function calling.
Gemma 4 thinking-mode default. Launch llama-server with --jinja --chat-template-kwargs '{"enable_thinking":false}' or PI Agent sees empty content on every call (output goes to reasoning_content).

Watch num_ctx. PI Agent context files balloon fast. Going above the VRAM ceiling triggers silent CPU fallback and tanks tool-call reliability. See num_ctx VRAM overflow.

For the broader local-agent shortlist, Best Local Alternatives to Claude Code covers the harness landscape.

What PI Agent actually is

PI is a terminal-based coding harness, not a full IDE plugin. You run it in your terminal, it reads and writes files, runs commands, and iterates on code. Think Claude Code’s terminal mode, but open source and model-agnostic.

The design philosophy is aggressive minimalism:

Feature	PI Agent	Claude Code
System prompt	~200 tokens	~10,000 tokens
Default tools	4 (read, write, edit, bash)	10+ (read, write, edit, bash, glob, grep, web search, sub-agents…)
Model support	324 models across 20+ providers	Claude family only
Local model support	Ollama, vLLM, LM Studio, llama.cpp	None
Permission model	YOLO by default	5 safety modes, deny-first sandbox
Extension system	TypeScript in-process, 25+ event hooks	Shell-command hooks, 14 event types
Price	Free (MIT license)	$20–200/month or API keys
Sub-agents	Via extensions (spawn PI recursively)	Built-in (7 parallel)
MCP support	No (uses CLI tools instead)	Yes

The 200-token system prompt is a deliberate choice. Mario’s argument: modern models have been RL-trained so heavily on coding agent tasks that they already know what to do. A 10,000-token prompt mostly tells the model things it already knows, while burning context that could go toward your actual codebase. Whether you agree with that philosophy depends on how much you trust your model.

The YOLO mode default is the other big philosophical difference. PI gives the agent full filesystem access and unrestricted command execution. No permission popups, no “allow this tool?” dialogs. The rationale: once an agent can write and execute code, any safety gate is theater. If you want permission gates, you build them yourself as an extension.

Setup: PI + Ollama in 5 minutes

Prerequisites:

Node.js 14+ installed
Ollama installed and running with at least one coding model pulled
A GPU with enough VRAM for your chosen model (see model table below)

Step 1: Install PI

npm install -g --ignore-scripts @earendil-works/pi-coding-agent

Or use the official installer:

curl -fsSL https://pi.dev/install.sh | sh

The package moved from @mariozechner/pi-coding-agent to @earendil-works/pi-coding-agent in May 2026 when the repo transferred to a new organization. Same authors (Mario Zechner and Armin Ronacher), same binary (pi), same config paths. The old package name is frozen and no longer receives updates.

Step 2: Pull a coding model in Ollama

# Pick one based on your hardware — see model table below
ollama pull qwen3.6:35b             # 24GB clean / 16GB with --cpu-moe offload
ollama pull qwen3-coder-next        # 64GB+ VRAM/unified memory
ollama pull qwen2.5-coder:7b        # 8GB VRAM minimum

Step 3: Configure PI for Ollama

Create ~/.pi/agent/models.json:

{
  "providers": {
    "ollama": {
      "baseUrl": "http://localhost:11434/v1",
      "api": "openai-completions",
      "apiKey": "ollama",
      "models": [
        { "id": "qwen3.6:35b", "name": "Qwen 3.6 35B-A3B", "contextWindow": 262144 },
        { "id": "qwen3-coder-next", "name": "Qwen3 Coder Next", "contextWindow": 256000 },
        { "id": "qwen2.5-coder:7b", "name": "Qwen 2.5 Coder 7B", "contextWindow": 32768 }
      ]
    }
  }
}

Create ~/.pi/agent/settings.json:

{
  "defaultProvider": "ollama",
  "defaultModel": "qwen3.6:35b"
}

Step 4: Run it

cd your-project
pi

PI opens an interactive terminal session. Start typing what you want done. It’ll read your files, write code, run commands, and iterate.

Switch models mid-session with /model if you want to try a different one without losing context.

Which local models work for agentic coding

Not every model that’s “good at code” works well as a coding agent. Agent tasks require tool calling, multi-step reasoning, and the ability to recover from errors. A model that generates clean code in a single shot might fall apart when asked to read a file, edit a function, run tests, and fix failures in a loop.

Here’s what actually works:

Model	Params (active)	VRAM (Q4)	Agent Quality	Speed	Best For
Qwen3-Coder-Next	80B (3B)	~52GB	Excellent	N/A (needs 64GB+)	Serious agent work on high-VRAM systems
Qwen 3.6 35B-A3B	35B (3B)	~24GB	Excellent	101 / 81 tok/s (RTX 3090, short/long)	24GB cards clean; 16GB with `--cpu-moe`
Qwen 3.6-27B dense	27B	~17GB	Excellent	~25 tok/s (Q4_K_M, 65K ctx)	Dense alternative on 24GB cards
Qwen 2.5 Coder 7B	7B	~5GB	Decent	~90 tok/s	Quick tasks on 8GB cards
Qwen 3.5 9B	9B	~6GB	Decent	~70 tok/s	8GB general-purpose alternative
DeepSeek-Coder-V2 16B	16B (2.4B)	~10GB	Good	~50 tok/s	MoE alternative on 12GB+ cards

The pick: Qwen 3.6 35B-A3B

For most people running local coding agents on a 24GB card, Qwen 3.6 35B-A3B is the model to use. It’s a 35B parameter MoE with 3B active per token, ~24GB at Q4 on a single RTX 3090. Amine Raji benched it at 101.7 tok/s on short prompts and 80.9 tok/s on long prompts via native llama.cpp on an RTX 3090. That’s ~30% slower than the Qwen 3.5 35B-A3B it replaces — the hybrid attention adds per-token compute — but agent quality goes up enough to justify the tradeoff.

On SWE-bench Verified, Qwen 3.6-27B dense hits 77.2 per Qwen’s model card. The 35B-A3B MoE trades a small quality margin for lower hardware requirements and faster generation. Both are in the same tier as frontier cloud models on real-world GitHub issue resolution.

For 16GB cards, run 3.6 35B-A3B with --cpu-moe to keep routed experts on system RAM. You’ll lose throughput compared to the 24GB clean path, but the agent loop stays usable. The Run Qwen 3.6-35B MoE Locally guide covers the offload setup.

For agent loops specifically, speed matters as much as quality. A coding agent might make 20–50 tool calls per task, each burning hundreds of tokens. At 80+ tok/s, Qwen 3.6 35B-A3B keeps the loop responsive. At 10 tok/s on a dense 32B at 24GB, you’re waiting 30+ seconds between each step. That’s the difference between a tool you’ll use daily and one that collects dust.

If you have more VRAM: Qwen3-Coder-Next

If you have 64GB+ of VRAM (Mac Studio with 96GB unified, dual RTX 3090s with offload, or a workstation GPU), Qwen3-Coder-Next is the model to run. 80B MoE with 3B active, 256K context, ~52GB at Q4. It scores 70%+ on SWE-bench Verified and 36.2 on Terminal-Bench 2.0, putting it in the same tier as frontier cloud models for agentic coding.

On an AMD Strix Halo with 64GB unified memory, community testing shows ~37 tok/s at 32K context. On dual 3090s with Q3 quantization, expect ~25 tok/s generation and ~1,000 tok/s prefill at 32K context. That’s fast enough for productive agent work.

If you only have 8GB: Qwen 2.5 Coder 7B

It works. The model can follow basic tool-calling patterns, read files, and make edits. But it struggles with multi-step reasoning, frequently loses track of what it’s doing in longer agent sessions, and generates more errors that need correction. Fine for “edit this function” or “write a test for this class.” Not reliable for “refactor this module and update all the tests.”

Qwen 3.5 9B is the general-purpose alternative at this tier — newer architecture, better reasoning, slightly more VRAM (~6GB at Q4). For agentic coding specifically, the code-specialized 2.5 Coder 7B still tends to win on tool-calling reliability at low VRAM. If you’re mixing coding with general Q&A, swap to 3.5 9B.

What the experience is actually like

Running PI with a local model is not the same as running Claude Code with Sonnet.

What works well:

File reading, editing, and creation. The core loop is solid.
Running tests and iterating on failures. This is where agents earn their keep, and a good local model handles it.
Grep/search through a codebase to find relevant code before making changes.
Simple refactors: rename a variable across files, extract a function, add error handling.
Writing boilerplate: test files, config files, CI pipelines.

What gets shaky:

Multi-file architectural changes. Frontier models handle these better because they can hold more context and reason about distant dependencies.
Long sessions (50+ tool calls). Local models lose coherence faster than Claude Sonnet on extended tasks. PI’s tree-based session structure helps — you can fork back to a good state when the model goes off track.
Unfamiliar frameworks or libraries. Local models have less training data coverage for niche tools compared to frontier models trained on more data.

The harness compensates for model intelligence. Mario makes this point in his blog post, and he’s right. PI’s demos show Haiku (a bottom-tier cloud model) doing useful coding work, because the harness structure, hooks, and task management add determinism that the model itself lacks. The same applies to local models. A well-structured PI setup with AGENTS.md files, extension-based guardrails, and a strong local model handles most day-to-day coding tasks.

Basic customization

PI’s extension system uses TypeScript. You don’t need it for basic usage, but a few simple customizations make the local model experience better.

Project-level instructions (AGENTS.md)

Create an AGENTS.md file in your project root. PI automatically reads it as context:

# Project: my-app

## Stack
- TypeScript, Node.js 20, Express
- PostgreSQL with Drizzle ORM
- Jest for testing

## Rules
- Always run `npm test` after making changes
- Use existing patterns from src/ — don't invent new abstractions
- Import from @/lib, not relative paths

This helps more than anything else you can configure. It grounds the local model in your project’s conventions without burning system prompt tokens.

Model switching for different tasks

You can switch models mid-session with /model. A practical workflow:

Start with Qwen 2.5 Coder 7B for quick lookups and small edits (fast, cheap on VRAM)
Switch to Qwen 3.6 35B-A3B for complex reasoning and multi-file changes
Switch back to 7B for running tests and fixing lint errors

PI tracks token usage and cost (even for local models, you can set notional costs to track context consumption). The /model command preserves your session history.

Slash commands via skills

PI supports reusable prompt templates as “skills.” Create ~/.pi/agent/skills/review.md:

description: Code review the staged changes
---
Run `git diff --cached` and review the changes.

Check for:
- Bugs or logic errors
- Missing error handling
- Naming that doesn't match the codebase
- Tests that should exist but don't

Be specific. Reference line numbers.

Then use /review in any PI session.

Limitations you should know about

PI makes tradeoffs. Some are deliberate design choices, some are gaps.

No MCP support. PI doesn’t implement the Model Context Protocol. Mario’s position is that MCP adds 7–9% context overhead for what’s effectively a wrapper around CLI tools. PI’s approach: install the CLI tool, give it a README, and the agent figures it out. This works fine in practice, but it means you can’t reuse MCP servers you’ve already configured for other tools.

No built-in sub-agents. Claude Code can spin up 7 parallel sub-agents to handle tasks concurrently. PI doesn’t have this out of the box. You can build it by spawning PI recursively via bash (pi --print), but it’s manual work compared to Claude Code’s Task tool.

TypeScript SDK for deep customization. If you want to write extensions that modify tool behavior, add custom UI elements, or intercept agent actions, you need TypeScript. This is a strength (in-process, 25+ event hooks, access to full session state) but also a barrier if TypeScript isn’t your stack.

No IDE integration. PI is terminal-only. No VS Code panel, no JetBrains plugin. You work in your terminal alongside your editor. Some people prefer this; others find it disruptive.

No enterprise features. No SSO, no audit logs, no managed deployment. PI is for individual developers and small teams who want control over their tools.

Model quality ceiling. Even the best local models don’t match Claude Sonnet on hard coding tasks. Qwen3-Coder-Next gets close (70%+ SWE-bench vs. Claude’s ~72%), but Qwen 3.6 35B-A3B on a 24GB GPU (or 16GB with --cpu-moe) is a step below that. You’re trading some capability for privacy and zero cost. For most day-to-day coding tasks, the tradeoff is worth it. For hard debugging and architectural decisions, you might still want a frontier model.

PI vs. Claude Code vs. Aider: when to use each

Factor	PI + Local Model	Claude Code	Aider
Cost	$0 (local)	$20–200/mo or API	API costs only
Privacy	Full (nothing leaves your machine)	Code goes to Anthropic	Code goes to provider
Model flexibility	Any model, any provider	Claude only	Most models
Best model quality	Good (local) to Excellent (local + cloud mix)	Excellent	Excellent (uses frontier)
Setup complexity	Medium (Ollama + PI config)	Low (npm install, API key)	Low (pip install, API key)
Extension system	TypeScript, very flexible	Hooks + plugins	Limited
Git integration	Manual (via bash)	Built-in	Excellent (auto-commits)
IDE integration	None (terminal only)	VS Code, JetBrains	VS Code watch mode

Use PI + local models when: privacy matters, you want zero ongoing costs, you enjoy customizing your tools, or you’re running coding agents on infrastructure you control.

Use Claude Code when: you need the best possible model quality, you want batteries-included with minimal setup, or you’re working on hard problems where frontier model intelligence is the bottleneck.

Use Aider when: you want tight git integration with automatic commits, you prefer a Python ecosystem, or you want to mix cloud models without building your own configuration.

Getting started

Install Ollama and pull qwen3.6:35b (or whatever fits your GPU)
npm install -g --ignore-scripts @earendil-works/pi-coding-agent
Create ~/.pi/agent/models.json and settings.json as shown above
Add an AGENTS.md to your project with stack and conventions
Run pi in your project directory

You’ll know within 30 minutes whether the local model experience works for your workflow. If it does, you just saved yourself $20–200/month and gained full privacy over your codebase.