Run Your Coding Agent on Local Models with PI Agent + Ollama

Quick Answer: PI Agent is a free, open-source coding agent that runs on any model, including local ones via Ollama. Install PI, point it at your local Qwen 3.5 or Qwen3-Coder-Next, and you’ve got a private coding agent with zero API costs. It won’t match Claude Code on frontier-model tasks, but a well-configured PI with a strong local model handles real coding work.

📚 More on this topic: Best Local Coding Models · llama.cpp vs Ollama vs vLLM · Best Local Models for OpenClaw · Local Alternatives to Claude Code

Most coding agents lock you into a single model provider. Claude Code requires Anthropic. GitHub Copilot requires OpenAI. You pay per token, your code goes through someone else’s servers, and you have no control over what model runs underneath.

PI Agent is different. Built by Mario Zechner (the same developer behind OpenClaw), PI is a minimal, MIT-licensed terminal coding agent that works with any model from any provider, including local ones running on your own hardware via Ollama. No subscription, no API costs, no code leaving your machine.

This guide covers how to set it up, which local models work best for agentic coding, and what to expect from the experience.

What PI Agent actually is

PI is a terminal-based coding harness, not a full IDE plugin. You run it in your terminal, it reads and writes files, runs commands, and iterates on code. Think Claude Code’s terminal mode, but open source and model-agnostic.

The design philosophy is aggressive minimalism:

Feature	PI Agent	Claude Code
System prompt	~200 tokens	~10,000 tokens
Default tools	4 (read, write, edit, bash)	10+ (read, write, edit, bash, glob, grep, web search, sub-agents…)
Model support	324 models across 20+ providers	Claude family only
Local model support	Ollama, vLLM, LM Studio, llama.cpp	None
Permission model	YOLO by default	5 safety modes, deny-first sandbox
Extension system	TypeScript in-process, 25+ event hooks	Shell-command hooks, 14 event types
Price	Free (MIT license)	$20–200/month or API keys
Sub-agents	Via extensions (spawn PI recursively)	Built-in (7 parallel)
MCP support	No (uses CLI tools instead)	Yes

The 200-token system prompt is a deliberate choice. Mario’s argument: modern models have been RL-trained so heavily on coding agent tasks that they already know what to do. A 10,000-token prompt mostly tells the model things it already knows, while burning context that could go toward your actual codebase. Whether you agree with that philosophy depends on how much you trust your model.

The YOLO mode default is the other big philosophical difference. PI gives the agent full filesystem access and unrestricted command execution. No permission popups, no “allow this tool?” dialogs. The rationale: once an agent can write and execute code, any safety gate is theater. If you want permission gates, you build them yourself as an extension.

Setup: PI + Ollama in 5 minutes

Prerequisites:

Node.js 14+ installed
Ollama installed and running with at least one coding model pulled
A GPU with enough VRAM for your chosen model (see model table below)

Step 1: Install PI

npm install -g @mariozechner/pi-coding-agent

Step 2: Pull a coding model in Ollama

# Pick one based on your hardware — see model table below
ollama pull qwen3.5:35b-a3b        # 16GB+ VRAM
ollama pull qwen3-coder-next        # 48GB+ VRAM/unified memory
ollama pull qwen2.5-coder:7b        # 8GB VRAM minimum

Step 3: Configure PI for Ollama

Create ~/.pi/agent/models.json:

{
  "providers": {
    "ollama": {
      "baseUrl": "http://localhost:11434/v1",
      "api": "openai-completions",
      "apiKey": "ollama",
      "models": [
        { "id": "qwen3.5:35b-a3b", "name": "Qwen 3.5 35B-A3B", "contextWindow": 131072 },
        { "id": "qwen3-coder-next", "name": "Qwen3 Coder Next", "contextWindow": 256000 },
        { "id": "qwen2.5-coder:7b", "name": "Qwen 2.5 Coder 7B", "contextWindow": 32768 }
      ]
    }
  }
}

Create ~/.pi/agent/settings.json:

{
  "defaultProvider": "ollama",
  "defaultModel": "qwen3.5:35b-a3b"
}

Step 4: Run it

cd your-project
pi

PI opens an interactive terminal session. Start typing what you want done. It’ll read your files, write code, run commands, and iterate.

Switch models mid-session with /model if you want to try a different one without losing context.

Which local models work for agentic coding

Not every model that’s “good at code” works well as a coding agent. Agent tasks require tool calling, multi-step reasoning, and the ability to recover from errors. A model that generates clean code in a single shot might fall apart when asked to read a file, edit a function, run tests, and fix failures in a loop.

Here’s what actually works:

Model	Params (active)	VRAM (Q4)	Agent Quality	Speed (16GB GPU)	Best For
Qwen3-Coder-Next	80B (3B)	~47GB	Excellent	N/A (needs 48GB+)	Serious agent work on high-VRAM systems
Qwen 3.5 35B-A3B	35B (3B)	~8GB	Very good	~44 tok/s	Best balance for 16GB GPUs
Qwen3.5-27B dense	27B	~16GB	Very good	~15 tok/s	Dense model, needs 24GB for real context
Qwen 2.5 Coder 32B	32B	~20GB	Good	~10 tok/s (24GB)	Older but solid, needs 24GB
Qwen 2.5 Coder 7B	7B	~5GB	Decent	~90 tok/s	Quick tasks on 8GB cards
DeepSeek-Coder-V2 16B	16B (2.4B)	~10GB	Good	~50 tok/s	Decent MoE alternative

The pick: Qwen 3.5 35B-A3B

For most people running local coding agents, Qwen 3.5 35B-A3B is the model to use. It’s a 35-billion parameter MoE model with only 3B active parameters per token. That means it fits on a 16GB GPU at Q4 with 100K context, generates tokens at ~44 tok/s, and still has the reasoning depth of a much larger model.

On SWE-bench Verified, the Qwen 3.5 series matches or exceeds models with 10x the active parameter count. The 27B dense variant hits 72.4, competitive with GPT-5 mini. The 35B MoE trades a small quality margin for much lower hardware requirements.

For agent loops specifically, speed matters as much as quality. A coding agent might make 20–50 tool calls per task, each burning hundreds of tokens. At 44 tok/s, Qwen 3.5 35B-A3B keeps the loop responsive. At 10 tok/s (Qwen 32B dense on 24GB), you’re waiting 30+ seconds between each step. That’s the difference between a tool you’ll use daily and one that collects dust.

If you have more VRAM: Qwen3-Coder-Next

If you have 48GB+ of VRAM (Mac Studio with 96GB unified, dual RTX 3090s, or a workstation GPU), Qwen3-Coder-Next is the model to run. It scores 70%+ on SWE-bench Verified and 36.2 on Terminal-Bench 2.0, putting it in the same tier as frontier cloud models for agentic coding.

On an AMD Strix Halo with 64GB unified memory, community testing shows ~37 tok/s at 32K context. On dual 3090s with Q3 quantization, expect ~25 tok/s generation and ~1,000 tok/s prefill at 32K context. That’s fast enough for productive agent work.

If you only have 8GB: Qwen 2.5 Coder 7B

It works. The model can follow basic tool-calling patterns, read files, and make edits. But it struggles with multi-step reasoning, frequently loses track of what it’s doing in longer agent sessions, and generates more errors that need correction. Fine for “edit this function” or “write a test for this class.” Not reliable for “refactor this module and update all the tests.”

What the experience is actually like

Running PI with a local model is not the same as running Claude Code with Sonnet.

What works well:

File reading, editing, and creation. The core loop is solid.
Running tests and iterating on failures. This is where agents earn their keep, and a good local model handles it.
Grep/search through a codebase to find relevant code before making changes.
Simple refactors: rename a variable across files, extract a function, add error handling.
Writing boilerplate: test files, config files, CI pipelines.

What gets shaky:

Multi-file architectural changes. Frontier models handle these better because they can hold more context and reason about distant dependencies.
Long sessions (50+ tool calls). Local models lose coherence faster than Claude Sonnet on extended tasks. PI’s tree-based session structure helps — you can fork back to a good state when the model goes off track.
Unfamiliar frameworks or libraries. Local models have less training data coverage for niche tools compared to frontier models trained on more data.

The harness compensates for model intelligence. Mario makes this point in his blog post, and he’s right. PI’s demos show Haiku (a bottom-tier cloud model) doing useful coding work, because the harness structure, hooks, and task management add determinism that the model itself lacks. The same applies to local models. A well-structured PI setup with AGENTS.md files, extension-based guardrails, and a strong local model handles most day-to-day coding tasks.

Basic customization

PI’s extension system uses TypeScript. You don’t need it for basic usage, but a few simple customizations make the local model experience better.

Project-level instructions (AGENTS.md)

Create an AGENTS.md file in your project root. PI automatically reads it as context:

# Project: my-app

## Stack
- TypeScript, Node.js 20, Express
- PostgreSQL with Drizzle ORM
- Jest for testing

## Rules
- Always run `npm test` after making changes
- Use existing patterns from src/ — don't invent new abstractions
- Import from @/lib, not relative paths

This helps more than anything else you can configure. It grounds the local model in your project’s conventions without burning system prompt tokens.

Model switching for different tasks

You can switch models mid-session with /model. A practical workflow:

Start with Qwen 2.5 Coder 7B for quick lookups and small edits (fast, cheap on VRAM)
Switch to Qwen 3.5 35B-A3B for complex reasoning and multi-file changes
Switch back to 7B for running tests and fixing lint errors

PI tracks token usage and cost (even for local models, you can set notional costs to track context consumption). The /model command preserves your session history.

Slash commands via skills

PI supports reusable prompt templates as “skills.” Create ~/.pi/agent/skills/review.md:

description: Code review the staged changes
---
Run `git diff --cached` and review the changes.

Check for:
- Bugs or logic errors
- Missing error handling
- Naming that doesn't match the codebase
- Tests that should exist but don't

Be specific. Reference line numbers.

Then use /review in any PI session.

Limitations you should know about

PI makes tradeoffs. Some are deliberate design choices, some are gaps.

No MCP support. PI doesn’t implement the Model Context Protocol. Mario’s position is that MCP adds 7–9% context overhead for what’s effectively a wrapper around CLI tools. PI’s approach: install the CLI tool, give it a README, and the agent figures it out. This works fine in practice, but it means you can’t reuse MCP servers you’ve already configured for other tools.

No built-in sub-agents. Claude Code can spin up 7 parallel sub-agents to handle tasks concurrently. PI doesn’t have this out of the box. You can build it by spawning PI recursively via bash (pi --print), but it’s manual work compared to Claude Code’s Task tool.

TypeScript SDK for deep customization. If you want to write extensions that modify tool behavior, add custom UI elements, or intercept agent actions, you need TypeScript. This is a strength (in-process, 25+ event hooks, access to full session state) but also a barrier if TypeScript isn’t your stack.

No IDE integration. PI is terminal-only. No VS Code panel, no JetBrains plugin. You work in your terminal alongside your editor. Some people prefer this; others find it disruptive.

No enterprise features. No SSO, no audit logs, no managed deployment. PI is for individual developers and small teams who want control over their tools.

Model quality ceiling. Even the best local models don’t match Claude Sonnet on hard coding tasks. Qwen3-Coder-Next gets close (70%+ SWE-bench vs. Claude’s ~72%), but Qwen 3.5 35B-A3B on a 16GB GPU is a step below that. You’re trading some capability for privacy and zero cost. For most day-to-day coding tasks, the tradeoff is worth it. For hard debugging and architectural decisions, you might still want a frontier model.

PI vs. Claude Code vs. Aider: when to use each

Factor	PI + Local Model	Claude Code	Aider
Cost	$0 (local)	$20–200/mo or API	API costs only
Privacy	Full (nothing leaves your machine)	Code goes to Anthropic	Code goes to provider
Model flexibility	Any model, any provider	Claude only	Most models
Best model quality	Good (local) to Excellent (local + cloud mix)	Excellent	Excellent (uses frontier)
Setup complexity	Medium (Ollama + PI config)	Low (npm install, API key)	Low (pip install, API key)
Extension system	TypeScript, very flexible	Hooks + plugins	Limited
Git integration	Manual (via bash)	Built-in	Excellent (auto-commits)
IDE integration	None (terminal only)	VS Code, JetBrains	VS Code watch mode

Use PI + local models when: privacy matters, you want zero ongoing costs, you enjoy customizing your tools, or you’re running coding agents on infrastructure you control.

Use Claude Code when: you need the best possible model quality, you want batteries-included with minimal setup, or you’re working on hard problems where frontier model intelligence is the bottleneck.

Use Aider when: you want tight git integration with automatic commits, you prefer a Python ecosystem, or you want to mix cloud models without building your own configuration.

Getting started

Install Ollama and pull qwen3.5:35b-a3b (or whatever fits your GPU)
npm install -g @mariozechner/pi-coding-agent
Create ~/.pi/agent/models.json and settings.json as shown above
Add an AGENTS.md to your project with stack and conventions
Run pi in your project directory

You’ll know within 30 minutes whether the local model experience works for your workflow. If it does, you just saved yourself $20–200/month and gained full privacy over your codebase.