Running OpenClaw 100% Local — Zero API Costs
📚 More on this topic: Best Models for OpenClaw · OpenClaw Token Optimization · OpenClaw Setup Guide · OpenClaw Security Guide · Run Your First Local LLM
Most OpenClaw guides assume you’re running Claude or GPT-4 behind the scenes. That means API keys, monthly bills, and the nagging anxiety of watching your Anthropic balance drain while the agent runs overnight.
There’s another path. OpenClaw’s architecture doesn’t care where the intelligence comes from. It speaks the OpenAI API format, and Ollama speaks it too. Point the config at localhost, pull a capable model, and the entire system runs on your hardware. No API keys, no cloud calls, no monthly bills.
The cost is capability. A local 32B model isn’t Claude Opus. But for a surprising amount of agent work (file management, code generation, research compilation, scheduled automations), it’s enough.
What You’re Replacing (And What It Costs)
Before configuring anything, understand what running OpenClaw on cloud APIs actually costs. These numbers come from real user reports and our token optimization guide.
The Default API Bill
| Usage Pattern | Monthly Cost (Sonnet) | Monthly Cost (Opus) |
|---|---|---|
| Idle only (heartbeats) | $60-150 | $150-450 |
| Light daily use (1-2 tasks) | $90-200 | $250-600 |
| Active daily use (5-10 tasks) | $200-400 | $500-1,500 |
| Overnight batch jobs | $6-150 per run | $50-500 per run |
Even optimized setups spend $15-30/month. A fully local setup costs $0/month in recurring fees — you pay once for the hardware and electricity.
The Real Math
One OpenClaw user loaded $25 onto Anthropic and watched it drain to $5 in a single day with the agent sitting idle. Another woke up to $500 gone overnight from a runaway task. These aren’t edge cases. OpenClaw’s default config loads your full context, session history, and memory on every API call — including the heartbeats that fire every 30 minutes.
Running locally eliminates this problem. The agent can heartbeat all day and your electricity bill goes up by pennies.
What You Need
Hardware Requirements
Local OpenClaw needs a GPU. The model running your agent determines what tasks succeed and what tasks fail, and bigger models succeed more often.
| VRAM | Best Model | Agent Capability | Success Rate (Routine Tasks) |
|---|---|---|---|
| 8GB | Qwen 3 8B (Q4) | Basic — single-step tasks only | ~40-50% |
| 12GB | Qwen 3 14B (Q4) or DeepSeek-R1-Distill 14B | Moderate — simple chains work | ~50-60% |
| 16GB | DeepSeek-R1-Distill 14B (Q8) | Better quality on moderate tasks | ~60-70% |
| 24GB | Qwen 3 32B (Q4_K_M) | Practical — most routine tasks succeed | ~80-90% |
| 48GB+ | Llama 3.3 70B (Q4) | Strong — approaches API quality | ~90%+ |
24GB is the realistic minimum for useful local agent work. That means an RTX 3090 ($700-900 used on eBay or Amazon) or an RTX 4090. For a deeper breakdown of which models work at each tier, see our model selection guide.
Software Requirements
- Ollama — Manages model downloads, inference, and the OpenAI-compatible API
- OpenClaw — The agent platform itself
- A model — Downloaded through Ollama (3-40GB depending on choice)
Step 1: Install and Configure Ollama
If you already have Ollama running, skip to Step 2.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Verify it's running
ollama --version
Pull Your Agent Model
Choose based on your VRAM. Pull the model before configuring OpenClaw — the download takes time and you want to verify it works first.
24GB VRAM (recommended):
ollama pull qwen3:32b
12-16GB VRAM:
ollama pull deepseek-r1:14b
8GB VRAM (limited):
ollama pull qwen3:8b
Test the Model
Run a quick sanity check:
ollama run qwen3:32b "List three files in a typical Linux home directory and explain what each contains. Be concise."
If you get a coherent response, Ollama is working. If it errors out or runs painfully slow, you probably have a VRAM issue. Check nvidia-smi to confirm GPU memory usage.
Verify the API Endpoint
OpenClaw connects through Ollama’s OpenAI-compatible API, not the chat interface:
curl http://localhost:11434/v1/models
You should see your pulled model listed. This is the endpoint OpenClaw will use.
Step 2: Configure OpenClaw for Local-Only
The Config File
Edit ~/.openclaw/openclaw.json. If you’re starting fresh, this is the entire config needed for local-only operation:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://127.0.0.1:11434/v1",
"apiKey": "ollama-local",
"api": "openai-completions",
"models": [
{ "id": "qwen3:32b" }
]
}
}
},
"agents": {
"defaults": {
"model": {
"primary": "ollama/qwen3:32b",
"heartbeat": "ollama/qwen3:32b"
}
}
}
}
Replace qwen3:32b with whatever model you pulled in Step 1.
What Each Field Does
| Field | Purpose |
|---|---|
baseUrl | Points to Ollama’s local API. Always http://127.0.0.1:11434/v1 unless you changed Ollama’s port. |
apiKey | Ollama doesn’t need authentication. Use any non-empty string — "ollama-local" is convention. |
api | Must be "openai-completions" for Ollama’s compatibility layer. |
models | List the model IDs exactly as they appear in ollama list. |
primary | The model that handles all agent tasks. |
heartbeat | The model that handles keep-alive pings. Same model is fine for local — it’s free either way. |
No API Keys Needed
That’s the point. No Anthropic key, no OpenAI key, no environment variables to set. The apiKey field exists because the OpenAI API format requires it, but Ollama ignores it.
Step 3: Optimize for Agent Workloads
Default Ollama settings are tuned for chat, not agent work. Agents need larger context windows and more predictable output.
Create a Custom Modelfile
cat << 'EOF' > ~/openclaw-agent.Modelfile
FROM qwen3:32b
PARAMETER temperature 0.7
PARAMETER num_ctx 16384
PARAMETER num_predict 4096
PARAMETER repeat_penalty 1.1
EOF
ollama create openclaw-agent -f ~/openclaw-agent.Modelfile
Then update your openclaw.json to use openclaw-agent instead of qwen3:32b.
Why These Settings
| Parameter | Default | Agent Setting | Reason |
|---|---|---|---|
temperature | 0.8 | 0.7 | Lower = more consistent tool calls and structured output |
num_ctx | 2048-4096 | 16384 | Agent conversations accumulate fast: task descriptions, file contents, error logs |
num_predict | 128-2048 | 4096 | Agents sometimes generate long code blocks or detailed plans |
repeat_penalty | 1.1 | 1.1 | Prevents the model from getting stuck in loops, which matters for agents |
Context Window Tradeoffs
Increasing num_ctx eats VRAM. On 24GB with Qwen 3 32B at Q4:
| Context Length | VRAM Used | Remaining Headroom |
|---|---|---|
| 4096 | ~19GB | ~5GB |
| 8192 | ~20GB | ~4GB |
| 16384 | ~22GB | ~2GB |
| 32768 | ~25GB+ | OOM on 24GB |
16384 is the practical maximum on 24GB. If you need longer context, drop to a smaller model or get more VRAM.
Step 4: Multi-Model Local Setup (Optional)
If you have the VRAM or patience for model swapping, you can run multiple local models for different task types. It’s the same tiered routing concept from our token optimization guide, but entirely free.
Two-Model Config
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://127.0.0.1:11434/v1",
"apiKey": "ollama-local",
"api": "openai-completions",
"models": [
{ "id": "qwen3:32b" },
{ "id": "qwen3:8b" }
]
}
}
},
"agents": {
"defaults": {
"model": {
"primary": "ollama/qwen3:32b",
"heartbeat": "ollama/qwen3:8b"
}
}
}
}
This uses the 8B model for heartbeats and reserves 32B for real work. On 24GB VRAM, Ollama swaps models automatically. Only one is loaded at a time, so VRAM isn’t doubled.
Specialist Models
Pull different models for different strengths:
# General agent tasks
ollama pull qwen3:32b
# Code-heavy skills
ollama pull qwen2.5-coder:32b
# Complex reasoning and planning
ollama pull deepseek-r1:32b
Swap between them manually or configure OpenClaw skills to specify which model they prefer. The coding skill routes to Qwen Coder, the planning skill routes to DeepSeek-R1.
What Works and What Doesn’t
Tasks That Work Well Locally
| Task Type | Example | Why It Works |
|---|---|---|
| File management | Organize downloads by type, rename batch files | Simple logic, clear instructions |
| Code generation | Write a Python script, create a bash automation | 32B models are decent coders |
| Data extraction | Parse CSVs, extract info from text files | Structured input/output |
| Scheduled routines | Daily backup, log rotation, report generation | Repeatable, well-defined |
| Local research | Summarize documents, compare files, search notes | No external API needed |
Tasks That Struggle Locally
| Task Type | Example | Why It Fails |
|---|---|---|
| Complex multi-step chains | 10+ step workflows with dependencies | Models lose track of state |
| Web interaction | Browse sites, fill forms, extract dynamic content | Needs real browser automation + strong reasoning |
| Novel problem-solving | “Figure out why X isn’t working” | Requires broad knowledge + creative inference |
| Long-running orchestration | Overnight batch with 14 sub-agents | Context accumulates, errors compound |
| Ambiguous instructions | “Make this better” | Needs inference about intent |
The Honest Assessment
Running OpenClaw 100% local on a 32B model is like having a reliable but junior employee. Give it clear instructions and well-defined tasks, and it performs. Give it vague direction or complex problems, and it flounders.
The power users running Claude Opus get the “wow” demos — agents that call restaurants, build entire websites, and orchestrate 14 parallel sub-agents overnight. Local models aren’t there yet. They might never be. The gap between a 32B parameter model and a frontier API model is real.
But “not Claude Opus” isn’t the same as “useless.” A locally-running agent that handles file organization, code generation, and scheduled tasks is useful, especially when it costs nothing to run.
Troubleshooting
“Model not found” Error
OpenClaw can’t find your Ollama model. Check:
# List available models
ollama list
# Verify the model ID in openclaw.json matches exactly
# Wrong: "qwen3-32b"
# Right: "qwen3:32b"
Slow Responses
Agent tasks can take 30-60 seconds per response on 32B models. This is normal for local inference. If responses take minutes:
- Check
nvidia-smi. Is the model actually on the GPU? - Reduce
num_ctx. Large context windows slow inference. - Try a smaller model. 14B responds 2-3x faster than 32B.
Out of Memory (OOM)
The model is too large for your VRAM, or the context window is too high:
# Check VRAM usage
nvidia-smi
# Reduce context in your modelfile
PARAMETER num_ctx 8192
# Or switch to a smaller model
ollama pull qwen3:14b
Agent Gets Stuck in Loops
Local models sometimes repeat themselves or get stuck. Two fixes:
- Increase
repeat_penaltyto 1.2-1.3 in your modelfile - Add explicit instructions in your agent’s memory: “If a task fails twice with the same approach, stop and report the failure instead of retrying.”
Ollama Crashes Under Load
Long agent sessions can exhaust system resources:
# Monitor during agent runs
watch -n 1 nvidia-smi
# Set Ollama to release VRAM when idle
export OLLAMA_KEEP_ALIVE=5m
The KEEP_ALIVE setting unloads the model after 5 minutes of inactivity, freeing VRAM. The next request takes a few seconds longer while the model reloads.
Local vs API: When to Switch
Running local isn’t always the right call. Here’s a framework:
| Factor | Go Local | Use API |
|---|---|---|
| Budget | Can’t afford $15-30/month | $30/month is nothing |
| Privacy | Data can’t leave your machine | Cloud providers are fine |
| Task complexity | Routine, well-defined tasks | Complex, novel problems |
| Reliability | Acceptable if some tasks fail | Must succeed first try |
| Speed | Can wait 30-60s per response | Need 2-5s responses |
| Hardware | Have 24GB+ VRAM | No GPU or under 16GB |
The hybrid approach from our token optimization guide — local for heartbeats and simple tasks, API for complex work. That’s the practical sweet spot for most users. Going 100% local is for people who either can’t or won’t use cloud APIs.
Cost Comparison: One Year
| Setup | Year 1 Cost | Year 2 Cost | 3-Year Total |
|---|---|---|---|
| OpenClaw + Opus (default) | $1,080-1,800 | $1,080-1,800 | $3,240-5,400 |
| OpenClaw + Sonnet (optimized) | $180-360 | $180-360 | $540-1,080 |
| OpenClaw + Ollama (100% local) | $700-900 (GPU) + ~$30 electricity | ~$30 electricity | $760-960 |
| OpenClaw + Ollama (already have GPU) | ~$30 electricity | ~$30 electricity | ~$90 |
If you already own a 24GB GPU, going local is obviously cheaper from day one. If you need to buy one, the used RTX 3090 pays for itself in 4-8 months versus even optimized API usage — and the GPU has resale value when you’re done.
Bottom Line
Running OpenClaw 100% local works. Not as impressively as Claude Opus, but well enough for a real daily-use agent that handles file management, code generation, and scheduled tasks. The setup takes 15-20 minutes, costs nothing after the initial hardware purchase, and kills the entire “my agent drained my API balance overnight” problem.
The recipe:
- Install Ollama, pull Qwen 3 32B (24GB VRAM) or the best model your hardware supports
- Point
openclaw.jsonathttp://127.0.0.1:11434/v1 - Create a custom modelfile with agent-optimized settings
- Accept the capability tradeoffs: clear instructions, well-defined tasks, patience with slower responses
If you have 24GB VRAM and want an always-on agent that doesn’t bill you, this is the setup. If you need Claude-tier reasoning, keep the API key and use our tiered routing approach instead.
Related Guides
- Best Local Models for OpenClaw — which models work for agent tasks by VRAM tier
- OpenClaw Token Optimization — hybrid approach for 97% cost reduction
- OpenClaw Setup Guide — installation from scratch
- OpenClaw Security Guide — lock down before connecting real accounts
- Run Your First Local LLM — Ollama beginner guide
- How Much VRAM Do You Need? — hardware requirements explained
- Used RTX 3090 Buying Guide — best GPU for budget local AI