Running OpenClaw 100% Local — Zero API Costs

📚 More on this topic: Best Models for OpenClaw · OpenClaw Token Optimization · OpenClaw Setup Guide · OpenClaw Security Guide · Run Your First Local LLM

Most OpenClaw guides assume you’re running Claude or GPT-4 behind the scenes. That means API keys, monthly bills, and the nagging anxiety of watching your Anthropic balance drain while the agent runs overnight.

There’s another path. OpenClaw’s architecture doesn’t care where the intelligence comes from. It speaks the OpenAI API format, and Ollama speaks it too. Point the config at localhost, pull a capable model, and the entire system runs on your hardware. No API keys, no cloud calls, no monthly bills.

The cost is capability. A local 32B model isn’t Claude Opus. But for a surprising amount of agent work (file management, code generation, research compilation, scheduled automations), it’s enough.

What You’re Replacing (And What It Costs)

Before configuring anything, understand what running OpenClaw on cloud APIs actually costs. These numbers come from real user reports and our token optimization guide.

The Default API Bill

Usage Pattern	Monthly Cost (Sonnet)	Monthly Cost (Opus)
Idle only (heartbeats)	$60-150	$150-450
Light daily use (1-2 tasks)	$90-200	$250-600
Active daily use (5-10 tasks)	$200-400	$500-1,500
Overnight batch jobs	$6-150 per run	$50-500 per run

Even optimized setups spend $15-30/month. A fully local setup costs $0/month in recurring fees — you pay once for the hardware and electricity.

The Real Math

One OpenClaw user loaded $25 onto Anthropic and watched it drain to $5 in a single day with the agent sitting idle. Another woke up to $500 gone overnight from a runaway task. These aren’t edge cases. OpenClaw’s default config loads your full context, session history, and memory on every API call — including the heartbeats that fire every 30 minutes.

Running locally eliminates this problem. The agent can heartbeat all day and your electricity bill goes up by pennies.

What You Need

Hardware Requirements

Local OpenClaw needs a GPU. The model running your agent determines what tasks succeed and what tasks fail, and bigger models succeed more often.

VRAM	Best Model	Agent Capability	Success Rate (Routine Tasks)
8GB	Qwen 3 8B (Q4)	Basic — single-step tasks only	~40-50%
12GB	Qwen 3 14B (Q4) or DeepSeek-R1-Distill 14B	Moderate — simple chains work	~50-60%
16GB	DeepSeek-R1-Distill 14B (Q8)	Better quality on moderate tasks	~60-70%
24GB	Qwen 3 32B (Q4_K_M)	Practical — most routine tasks succeed	~80-90%
48GB+	Llama 3.3 70B (Q4)	Strong — approaches API quality	~90%+

24GB is the realistic minimum for useful local agent work. That means an RTX 3090 ($700-900 used on eBay or Amazon) or an RTX 4090. For a deeper breakdown of which models work at each tier, see our model selection guide.

Software Requirements

Ollama — Manages model downloads, inference, and the OpenAI-compatible API
OpenClaw — The agent platform itself
A model — Downloaded through Ollama (3-40GB depending on choice)

Step 1: Install and Configure Ollama

If you already have Ollama running, skip to Step 2.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify it's running
ollama --version

Pull Your Agent Model

Choose based on your VRAM. Pull the model before configuring OpenClaw — the download takes time and you want to verify it works first.

24GB VRAM (recommended):

ollama pull qwen3:32b

12-16GB VRAM:

ollama pull deepseek-r1:14b

8GB VRAM (limited):

ollama pull qwen3:8b

Test the Model

Run a quick sanity check:

ollama run qwen3:32b "List three files in a typical Linux home directory and explain what each contains. Be concise."

If you get a coherent response, Ollama is working. If it errors out or runs painfully slow, you probably have a VRAM issue. Check nvidia-smi to confirm GPU memory usage.

Verify the API Endpoint

OpenClaw connects through Ollama’s OpenAI-compatible API, not the chat interface:

curl http://localhost:11434/v1/models

You should see your pulled model listed. This is the endpoint OpenClaw will use.

Step 2: Configure OpenClaw for Local-Only

The Config File

Edit ~/.openclaw/openclaw.json. If you’re starting fresh, this is the entire config needed for local-only operation:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://127.0.0.1:11434/v1",
        "apiKey": "ollama-local",
        "api": "openai-completions",
        "models": [
          { "id": "qwen3:32b" }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3:32b",
        "heartbeat": "ollama/qwen3:32b"
      }
    }
  }
}

Replace qwen3:32b with whatever model you pulled in Step 1.

What Each Field Does

Field	Purpose
`baseUrl`	Points to Ollama’s local API. Always `http://127.0.0.1:11434/v1` unless you changed Ollama’s port.
`apiKey`	Ollama doesn’t need authentication. Use any non-empty string — `"ollama-local"` is convention.
`api`	Must be `"openai-completions"` for Ollama’s compatibility layer.
`models`	List the model IDs exactly as they appear in `ollama list`.
`primary`	The model that handles all agent tasks.
`heartbeat`	The model that handles keep-alive pings. Same model is fine for local — it’s free either way.

No API Keys Needed

That’s the point. No Anthropic key, no OpenAI key, no environment variables to set. The apiKey field exists because the OpenAI API format requires it, but Ollama ignores it.

Step 3: Optimize for Agent Workloads

Default Ollama settings are tuned for chat, not agent work. Agents need larger context windows and more predictable output.

Create a Custom Modelfile

cat << 'EOF' > ~/openclaw-agent.Modelfile
FROM qwen3:32b

PARAMETER temperature 0.7
PARAMETER num_ctx 16384
PARAMETER num_predict 4096
PARAMETER repeat_penalty 1.1
EOF

ollama create openclaw-agent -f ~/openclaw-agent.Modelfile

Then update your openclaw.json to use openclaw-agent instead of qwen3:32b.

Why These Settings

Parameter	Default	Agent Setting	Reason
`temperature`	0.8	0.7	Lower = more consistent tool calls and structured output
`num_ctx`	2048-4096	16384	Agent conversations accumulate fast: task descriptions, file contents, error logs
`num_predict`	128-2048	4096	Agents sometimes generate long code blocks or detailed plans
`repeat_penalty`	1.1	1.1	Prevents the model from getting stuck in loops, which matters for agents

Context Window Tradeoffs

Increasing num_ctx eats VRAM. On 24GB with Qwen 3 32B at Q4:

Context Length	VRAM Used	Remaining Headroom
4096	~19GB	~5GB
8192	~20GB	~4GB
16384	~22GB	~2GB
32768	~25GB+	OOM on 24GB

16384 is the practical maximum on 24GB. If you need longer context, drop to a smaller model or get more VRAM.

Step 4: Multi-Model Local Setup (Optional)

If you have the VRAM or patience for model swapping, you can run multiple local models for different task types. It’s the same tiered routing concept from our token optimization guide, but entirely free.

Two-Model Config

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://127.0.0.1:11434/v1",
        "apiKey": "ollama-local",
        "api": "openai-completions",
        "models": [
          { "id": "qwen3:32b" },
          { "id": "qwen3:8b" }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "ollama/qwen3:32b",
        "heartbeat": "ollama/qwen3:8b"
      }
    }
  }
}

This uses the 8B model for heartbeats and reserves 32B for real work. On 24GB VRAM, Ollama swaps models automatically. Only one is loaded at a time, so VRAM isn’t doubled.

Specialist Models

Pull different models for different strengths:

# General agent tasks
ollama pull qwen3:32b

# Code-heavy skills
ollama pull qwen2.5-coder:32b

# Complex reasoning and planning
ollama pull deepseek-r1:32b

Swap between them manually or configure OpenClaw skills to specify which model they prefer. The coding skill routes to Qwen Coder, the planning skill routes to DeepSeek-R1.

What Works and What Doesn’t

Tasks That Work Well Locally

Task Type	Example	Why It Works
File management	Organize downloads by type, rename batch files	Simple logic, clear instructions
Code generation	Write a Python script, create a bash automation	32B models are decent coders
Data extraction	Parse CSVs, extract info from text files	Structured input/output
Scheduled routines	Daily backup, log rotation, report generation	Repeatable, well-defined
Local research	Summarize documents, compare files, search notes	No external API needed

Tasks That Struggle Locally

Task Type	Example	Why It Fails
Complex multi-step chains	10+ step workflows with dependencies	Models lose track of state
Web interaction	Browse sites, fill forms, extract dynamic content	Needs real browser automation + strong reasoning
Novel problem-solving	“Figure out why X isn’t working”	Requires broad knowledge + creative inference
Long-running orchestration	Overnight batch with 14 sub-agents	Context accumulates, errors compound
Ambiguous instructions	“Make this better”	Needs inference about intent

The Honest Assessment

Running OpenClaw 100% local on a 32B model is like having a reliable but junior employee. Give it clear instructions and well-defined tasks, and it performs. Give it vague direction or complex problems, and it flounders.

The power users running Claude Opus get the “wow” demos — agents that call restaurants, build entire websites, and orchestrate 14 parallel sub-agents overnight. Local models aren’t there yet. They might never be. The gap between a 32B parameter model and a frontier API model is real.

But “not Claude Opus” isn’t the same as “useless.” A locally-running agent that handles file organization, code generation, and scheduled tasks is useful, especially when it costs nothing to run.

Troubleshooting

“Model not found” Error

OpenClaw can’t find your Ollama model. Check:

# List available models
ollama list

# Verify the model ID in openclaw.json matches exactly
# Wrong: "qwen3-32b"
# Right: "qwen3:32b"

Slow Responses

Agent tasks can take 30-60 seconds per response on 32B models. This is normal for local inference. If responses take minutes:

Check nvidia-smi. Is the model actually on the GPU?
Reduce num_ctx. Large context windows slow inference.
Try a smaller model. 14B responds 2-3x faster than 32B.

Out of Memory (OOM)

The model is too large for your VRAM, or the context window is too high:

# Check VRAM usage
nvidia-smi

# Reduce context in your modelfile
PARAMETER num_ctx 8192

# Or switch to a smaller model
ollama pull qwen3:14b

Agent Gets Stuck in Loops

Local models sometimes repeat themselves or get stuck. Two fixes:

Increase repeat_penalty to 1.2-1.3 in your modelfile
Add explicit instructions in your agent’s memory: “If a task fails twice with the same approach, stop and report the failure instead of retrying.”

Ollama Crashes Under Load

Long agent sessions can exhaust system resources:

# Monitor during agent runs
watch -n 1 nvidia-smi

# Set Ollama to release VRAM when idle
export OLLAMA_KEEP_ALIVE=5m

The KEEP_ALIVE setting unloads the model after 5 minutes of inactivity, freeing VRAM. The next request takes a few seconds longer while the model reloads.

Local vs API: When to Switch

Running local isn’t always the right call. Here’s a framework:

Factor	Go Local	Use API
Budget	Can’t afford $15-30/month	$30/month is nothing
Privacy	Data can’t leave your machine	Cloud providers are fine
Task complexity	Routine, well-defined tasks	Complex, novel problems
Reliability	Acceptable if some tasks fail	Must succeed first try
Speed	Can wait 30-60s per response	Need 2-5s responses
Hardware	Have 24GB+ VRAM	No GPU or under 16GB

The hybrid approach from our token optimization guide — local for heartbeats and simple tasks, API for complex work. That’s the practical sweet spot for most users. Going 100% local is for people who either can’t or won’t use cloud APIs.

Cost Comparison: One Year

Setup	Year 1 Cost	Year 2 Cost	3-Year Total
OpenClaw + Opus (default)	$1,080-1,800	$1,080-1,800	$3,240-5,400
OpenClaw + Sonnet (optimized)	$180-360	$180-360	$540-1,080
OpenClaw + Ollama (100% local)	$700-900 (GPU) + ~$30 electricity	~$30 electricity	$760-960
OpenClaw + Ollama (already have GPU)	~$30 electricity	~$30 electricity	~$90

If you already own a 24GB GPU, going local is obviously cheaper from day one. If you need to buy one, the used RTX 3090 pays for itself in 4-8 months versus even optimized API usage — and the GPU has resale value when you’re done.

Bottom Line

Running OpenClaw 100% local works. Not as impressively as Claude Opus, but well enough for a real daily-use agent that handles file management, code generation, and scheduled tasks. The setup takes 15-20 minutes, costs nothing after the initial hardware purchase, and kills the entire “my agent drained my API balance overnight” problem.

The recipe:

Install Ollama, pull Qwen 3 32B (24GB VRAM) or the best model your hardware supports
Point openclaw.json at http://127.0.0.1:11434/v1
Create a custom modelfile with agent-optimized settings
Accept the capability tradeoffs: clear instructions, well-defined tasks, patience with slower responses

If you have 24GB VRAM and want an always-on agent that doesn’t bill you, this is the setup. If you need Claude-tier reasoning, keep the API key and use our tiered routing approach instead.

Best Local Models for OpenClaw — which models work for agent tasks by VRAM tier
OpenClaw Token Optimization — hybrid approach for 97% cost reduction
OpenClaw Setup Guide — installation from scratch
OpenClaw Security Guide — lock down before connecting real accounts
Run Your First Local LLM — Ollama beginner guide
How Much VRAM Do You Need? — hardware requirements explained
Used RTX 3090 Buying Guide — best GPU for budget local AI