📚 More on this topic: OpenClaw Setup Guide · OpenClaw Security Guide · Qwen Models Guide · DeepSeek Models Guide · What Can You Run on 24GB VRAM

OpenClaw doesn’t care what model powers it — you can plug in Claude, GPT-4, Gemini, or a local model through Ollama. But the model choice matters enormously for agent performance. An agent that needs to write code, debug failures, use tools, and recover from errors requires different capabilities than a chatbot.

This guide covers which local models actually work for agent tasks, what VRAM you need, and what the power users are running.


What Agent Tasks Require

Agent work is harder than chat. Here’s why:

CapabilityWhy Agents Need ItWhat Tests It
Tool useAgents call APIs, run shell commands, manipulate filesFunction calling, structured output
Multi-step reasoningTasks span many actions with dependenciesChain-of-thought, planning
Code generationBuilding skills, debugging, automationCoding benchmarks, real-world bugs
Error recoveryFirst approach often fails; agent must adaptSelf-correction, alternative solutions
Instruction followingComplex prompts with multiple constraintsFollowing formats precisely
Long contextConversation history, file contents, task stateContext utilization at 8K-32K

The famous restaurant reservation story from OpenClaw’s early days captures this: when OpenTable didn’t have availability, the agent autonomously downloaded voice software, called the restaurant, and made the reservation over the phone. That required code generation, tool use, error recovery, and multi-step planning — all in one task.

7B models struggle here. They can chat, but they can’t reliably orchestrate complex workflows.


What Power Users Actually Run

The “Society of Minds” Approach

Wes Roth, one of the most active OpenClaw experimenters, doesn’t rely on a single model. His setup uses:

  • Claude Opus 4.5 (via API) — Main orchestrator for complex tasks
  • Gemini 2.0 Pro — Specialized queries (Google APIs, YouTube optimization)
  • Local models (Ollama) — Cost-effective sub-tasks, always-on availability

The insight: Claude Opus struggled with YouTube API efficiency. Gemini, being Google-adjacent, suggested using RSS feeds instead of expensive API calls — a solution Claude didn’t surface. Different models have different knowledge and strengths.

The practical lesson: Pure local-only is a constraint, not a virtue. The most capable setups use the right model for each task.

What Wes Roth’s Agent Actually Does

In his first 24 hours:

  • Set up voice communication (Whisper + 11Labs)
  • Created YouTube analytics tools (pulling thousands of videos)
  • Built thumbnail analysis (5,700 images analyzed)
  • Self-replicated to a VPS
  • Generated AI videos on demand
  • Created WordPress pages autonomously

All of this was done with Claude Opus as the backbone. When he tried to run everything locally, he noted the limitations.

The Local-First Community

Others in the OpenClaw community run local-only for privacy or cost reasons. Their reports:

  • 32B models (Qwen 3, DeepSeek-R1-Distill-32B): Work reasonably well for most agent tasks
  • 14B models: Marginal — succeed at simpler tasks, fail at complex chains
  • 7B models: Not recommended for serious agent work

8GB VRAM (RTX 3060, 4060)

Honest assessment: Limited agent capability. 7B models can handle simple, single-step tasks but struggle with complex workflows.

ModelSizeContextAgent Suitability
Qwen 3.5 9B (Q4)~6.6GB262KRecommended — beats GPT-OSS-120B on GPQA Diamond. Vision-capable. Tool calling works in Ollama v0.17.6+.
Qwen 3 8B (Q4)~5GB32KBasic tasks, will fail on complex chains
Llama 3.1 8B (Q4)~5GB128KLonger context helps, still limited reasoning
DeepSeek-R1-Distill-Qwen-7B~5GB32KBetter reasoning, smaller capability

Recommendation: Qwen 3.5 9B is the backbone model at this tier. Stronger reasoning, 262K context, built-in vision, and function calling that actually works (the Ollama pipeline mismatch was fixed in v0.17.6). Update Ollama before pulling: curl -fsSL https://ollama.com/install.sh | sh. Still best to route complex tasks to an API.

ollama run qwen3.5:9b

12GB VRAM (RTX 3060 12GB, 4070)

Assessment: Can run 14B models, which handle simple-to-moderate agent tasks.

ModelSizeContextAgent Suitability
Qwen 3 14B (Q4)~9GB32KDecent all-rounder
DeepSeek-R1-Distill-Qwen-14B~9GB32KStrong reasoning, good for planning
Mistral Nemo 12B (Q4)~8GB128KLong context, moderate capability

Recommendation: DeepSeek-R1-Distill-Qwen-14B for reasoning-heavy workflows. Qwen 3 14B for general use.

# Best for 12GB — reasoning focus
ollama run deepseek-r1:14b

# Alternative — general purpose
ollama run qwen3:14b

16GB VRAM (RTX 4060 Ti 16GB, 4080)

Assessment: Sweet spot starts here. Can run larger 14B models at higher quantization or squeeze in smaller 30B+ models.

ModelSizeContextAgent Suitability
Qwen 3 14B (Q8)~15GB32KHigher quality 14B
DeepSeek-R1-Distill-Qwen-14B (Q8)~15GB32KBest reasoning at this tier
Qwen 3 32B (Q4, low context)~18GB8KPossible with aggressive settings

Recommendation: DeepSeek-R1-Distill-Qwen-14B at Q8 for best reasoning. The Q8 quantization matters for agent work — fewer errors on complex instructions.

# Best for 16GB
ollama run deepseek-r1:14b-q8_0

# Pushing it — needs tuning
ollama run qwen3:32b --ctx 8192

24GB VRAM (RTX 3090, 4090)

Assessment: This is where local agents get practical. 32B models handle most agent tasks reliably.

ModelSizeContextAgent Suitability
Qwen 3.5 27B (Q4_K_M)~17GB262KTop pick — SWE-bench 72.4, GPQA Diamond 85.5, native vision. Tool calling works in Ollama v0.17.6+.
Qwen 3.5 35B-A3B (MoE)~17GB262K112 tok/s on RTX 3090 — faster than most dense 7B models. 3B active params per token.
Qwen 3 32B (Q4_K_M)~20GB32KSolid all-rounder, 32K context limit
DeepSeek-R1-Distill-Qwen-32B~20GB32KExcellent reasoning, thinking mode
Qwen 2.5 Coder 32B~20GB32KBest for code-heavy skills
Llama 3.3 70B (Q4, partial offload)~40GBVariablePossible with CPU offload

Recommendation: Qwen 3.5 27B. SWE-bench 72.4, fits at ~17GB Q4, function calling works properly since Ollama v0.17.6. If you want speed over raw capability, the 35B-A3B MoE variant runs at 112 tok/s on an RTX 3090 — faster than most dense 7B models because only 3B parameters are active per token. The KV cache stays smaller at long contexts thanks to the linear attention layers.

# Top pick — tool calling fixed in Ollama v0.17.6+
ollama run qwen3.5:27b

# Speed option — 112 tok/s on 3090, 3B active params
ollama run qwen3.5:35b-a3b

# For complex reasoning / planning
ollama run deepseek-r1:32b

# For skill creation / coding
ollama run qwen2.5-coder:32b

48GB+ VRAM (Dual 3090, A6000, etc.)

Assessment: Full capability. Can run 70B models that approach API quality.

ModelSizeContextAgent Suitability
Llama 3.3 70B (Q4_K_M)~40GB128KFlagship open model
Qwen 3 72B (Q4)~42GB32KStrong alternative
DeepSeek-R1-Distill-Llama-70B~40GB32KBest open reasoning model

Recommendation: Llama 3.3 70B for general agent work. DeepSeek-R1-Distill-Llama-70B when you need maximum reasoning capability.

# Best overall at 48GB
ollama run llama3.3:70b

# Maximum reasoning
ollama run deepseek-r1:70b

→ Use our Planning Tool to check exact VRAM for your setup.


Model Comparison for Agent Tasks

Coding & Skill Creation

When your agent needs to write its own tools:

ModelSkill BuildingDebuggingSelf-Improvement
Qwen 3.5 27BExcellentExcellentVery Good
Qwen 3.5 35B-A3BVery GoodVery GoodGood — 112 tok/s on 3090, trades some quality for speed
Qwen 2.5 Coder 32BExcellentExcellentGood
Qwen 3 32BVery GoodVery GoodVery Good
DeepSeek-R1-Distill-32BGoodVery GoodExcellent
Llama 3.3 70BVery GoodVery GoodGood

Winner: Qwen 3.5 27B scores 72.4 on SWE-bench Verified and 80.7 on LiveCodeBench v6 — first consumer-GPU model to match GPT-5 mini on SWE-bench. The 35B-A3B MoE variant trades some quality for speed (112 tok/s on 3090). Qwen 2.5 Coder 32B remains strong for pure code tasks.

Reasoning & Planning

For multi-step tasks with complex dependencies:

ModelPlanningError RecoveryChain-of-Thought
Qwen 3.5 27B (/think mode)ExcellentExcellentExcellent
DeepSeek-R1-Distill-32BExcellentExcellentExcellent
Qwen 3 32B (/think mode)ExcellentVery GoodExcellent
Llama 3.3 70BVery GoodGoodGood
Qwen 2.5 Coder 32BGoodGoodFair

Winner: Qwen 3.5 27B scores 85.5 on GPQA Diamond and 92.0 on HMMT — both top marks at this parameter class. DeepSeek-R1-Distill-32B remains excellent for reasoning-heavy chains.

Tool Use & Function Calling

For structured output and API calls:

ModelFunction CallingJSON OutputAPI Integration
Qwen 3.5 27BExcellent (native)ExcellentExcellent (fixed in Ollama v0.17.6+)
Qwen 3 32BExcellentExcellentExcellent
Llama 3.3 70BVery GoodVery GoodVery Good
DeepSeek-R1-Distill-32BGoodGoodGood
Mistral Nemo 12BGoodGoodFair

Winner: Qwen 3.5 27B. Native tool-calling support, trained on it, and the Ollama pipeline mismatch was fixed in v0.17.6. Update Ollama if you’re on an older version.


Configuring Ollama for OpenClaw

Basic Setup

# Install or update Ollama (v0.17.6+ required for Qwen 3.5 tool calling)
curl -fsSL https://ollama.com/install.sh | sh

# Quickest setup — auto-configures Ollama + OpenClaw connection
ollama launch openclaw

# Or pull manually — Qwen 3.5 27B is the top pick
ollama pull qwen3.5:27b

# Speed option — 112 tok/s on RTX 3090
ollama pull qwen3.5:35b-a3b

# Verify it's working
ollama run qwen3.5:27b "Hello, can you confirm you're working?"

Exposing Ollama to OpenClaw

OpenClaw connects to Ollama’s API. By default, Ollama only listens on localhost:

# Check Ollama is running
curl http://localhost:11434/api/tags

If OpenClaw is on a different machine or in Docker, configure Ollama to listen on all interfaces:

# Set environment variable (add to ~/.bashrc or systemd service)
OLLAMA_HOST=0.0.0.0:11434

Optimizing for Agent Workloads

Agents benefit from:

  1. Higher context length — Conversation history accumulates fast
  2. Consistent output — Lower temperature for reliable tool calls
  3. Longer timeouts — Complex tasks take time

In your Ollama modelfile or OpenClaw config:

# Example modelfile customization
FROM qwen3.5:27b

PARAMETER temperature 0.7
PARAMETER num_ctx 16384
PARAMETER num_predict 4096

Hybrid Approaches

Local + API (Best of Both)

The pattern power users follow:

  1. Use local for: Always-on availability, simple tasks, privacy-sensitive operations
  2. Use API for: Complex reasoning, long chains, tasks requiring maximum capability

Configure OpenClaw to route based on task complexity (requires custom skill development).

Multi-Model Local Setup

Run different models for different purposes:

# Have multiple models available
ollama pull qwen3.5:27b        # Primary — strongest benchmarks, 262K context, tool calling works
ollama pull qwen3.5:35b-a3b   # Speed option — 112 tok/s on 3090
ollama pull deepseek-r1:32b    # Complex reasoning
ollama pull qwen2.5-coder:32b  # Skill development

OpenClaw skills can specify which model to use. A coding skill might route to Qwen Coder while a planning skill routes to DeepSeek-R1.

The “Society of Minds” Pattern

Wes Roth’s approach — multiple models collaborating:

  1. Orchestrator (Claude/GPT-4/Qwen 3.5 27B): Manages overall task flow
  2. Specialists (Gemini, Coder, etc.): Handle domain-specific queries
  3. Workers (smaller models): Execute simple sub-tasks cheaply

This requires custom skill development but produces better results than any single model.


Realistic Expectations

What Local Models Handle Well

  • Simple automation (file management, scheduling)
  • Straightforward coding tasks
  • Single-step API calls
  • Structured data extraction
  • Routine inbox triage

What Local Models Struggle With

  • Novel problem-solving (the restaurant phone call story)
  • Very long task chains (10+ step workflows)
  • Ambiguous instructions requiring inference
  • Tasks requiring broad world knowledge
  • Self-improvement and capability expansion

The Hardware Reality

The power users getting impressive results mostly run:

  • Claude Opus 4.5 ($15/M input, $75/M output) for complex tasks
  • Local models for cost optimization and always-on availability
  • Multiple API backends for specialized capabilities

Pure local-only is possible but requires:

  • 24GB+ VRAM minimum for reliable agent work
  • Acceptance of capability limitations vs frontier APIs
  • Willingness to retry failed tasks

Bottom Line

If you have 24GB+ VRAM: Qwen 3.5 27B (~17GB Q4) is the top pick — SWE-bench 72.4, 262K native context, tool calling fixed in Ollama v0.17.6+. For speed, the 35B-A3B MoE variant hits 112 tok/s on an RTX 3090. DeepSeek-R1-Distill-32B for complex planning tasks.

If you have 12-16GB VRAM: Run DeepSeek-R1-Distill-Qwen-14B or Qwen 3 14B. Expect limitations on complex multi-step tasks. Consider hybrid local + API.

If you have 8GB VRAM: Qwen 3.5 9B (~6.6GB Q4) is the recommended backbone — beats GPT-OSS-120B on GPQA Diamond, tool calling works in Ollama v0.17.6+. Still limited for complex agent chains. Route hard tasks to an API.

The honest take: The most impressive OpenClaw demos run on Claude Opus via API, not local models. If you want that level of capability locally, budget for serious hardware (48GB+ VRAM) and accept you’re still behind the frontier. If you want cost optimization and privacy, local models work for routine agent tasks on 24GB+ VRAM.

# Quickest setup — auto-configures everything
ollama launch openclaw

# Or manual setup
ollama pull qwen3.5:27b
ollama run qwen3.5:27b

# Configure OpenClaw to use it
# In OpenClaw setup: select Ollama, model: qwen3.5:27b

Local agents are real and useful. They’re not magic — the model determines what they can do, and bigger models do more. Qwen 3.5 is a real step up (262K context, vision, stronger benchmarks at smaller sizes), and the Ollama tool-calling pipeline works correctly as of v0.17.6. Update Ollama before you start: curl -fsSL https://ollama.com/install.sh | sh.

Updated March 2026 for Ollama v0.17.7, Qwen 3.5 tool calling fix, and OpenClaw v2026.3.2.