Related: Cost to Run LLMs Locally · OpenClaw Token Optimization · Planning Tool

Your API bill is too high. You already know this. What you probably don’t know is where the waste is hiding, and it’s almost never the obvious stuff.

This guide walks you through finding the leaks and plugging them. Fifteen minutes, a Python logger, and some configuration changes.


Step 1: Check Your Dashboard (5 Minutes)

Before touching any code, look at what you’re actually spending.

Pull up your usage page: Anthropic | OpenAI | DeepSeek

Look at the last 7 days and answer three questions:

  1. Which model is eating the most money? If 80% of your spend is on Opus but only 20% of your tasks need it, that’s your single biggest fix.
  2. What’s your average input size? If it’s over 2,000 tokens and you’re not using prompt caching, stop reading and enable caching right now. Come back after.
  3. Are there any spending spikes? A sudden jump usually means a runaway loop, an agent retrying, or conversation history that ballooned overnight.

This alone tells most people where 60% of the problem lives.


Step 2: Find the Four Hidden Drains

These are the costs that don’t show up in your prompt but absolutely show up on your bill.

System Prompts: A Tax on Every Request

Your system prompt gets re-sent as input tokens on every API call. It’s not “remembered” between calls.

A 2,000-token system prompt across 1,000 calls/day:

  • 2M extra input tokens/day
  • At Sonnet rates ($3/MTok): $6/day = $180/month
  • At Opus 4 rates ($15/MTok): $30/day = $900/month

That’s before your actual questions even start.

Cut your system prompt to the minimum that works. A 500-token prompt instead of 2,000 saves $135/month at Sonnet rates. Or use prompt caching to drop that cost by 90%.

Conversation History: The Snowball

Multi-turn conversations resend every previous message as input on every call. Turn 1 sends your message. Turn 10 sends all 10 messages. Turn 20 sends all 20.

A 20-turn conversation with 500 tokens per turn doesn’t cost 10,000 tokens. It costs ~105,000 tokens (the triangular sum). That’s 10x what most people expect.

Summarize older turns instead of sending full text. Or just start fresh conversations more often.

Tool Definitions: Hidden Padding

If you’re using function calling, every API call includes hidden tokens for tool infrastructure — even when no tools fire.

Anthropic documents this:

Tool ConfigurationHidden Tokens Per Request
auto or none tool choice346 tokens
any or specific tool choice313 tokens
Each tool definition (avg)~150 tokens
Bash tool245 tokens
Text editor tool700 tokens
Computer use466-499 tokens + 735 per tool

An agent with 5 tools adds ~1,100 tokens to every request (346 base + 750 for definitions). Invisible in your prompt, very real in your bill.

OpenAI’s function calling has similar overhead — tool schemas are serialized into the system prompt.

Strip out tools you don’t need on each call. A request that just needs a text response doesn’t need 5 tool definitions attached.

OpenAI Reasoning Tokens: The Invisible Output Tax

This one is specific to OpenAI’s o-series models (o1, o3, o3-mini, o4-mini) and it’s easy to miss entirely.

These models generate internal “reasoning tokens,” a chain-of-thought used to work through the problem. You pay for them as output tokens (the most expensive type), but they never appear in the response. You can’t cap them either.

A request that returns 500 visible tokens might consume 2,000+ total output tokens. At o1 rates ($60/MTok output), that’s the difference between $0.03 and $0.12 per request.

Check completion_tokens_details.reasoning_tokens in every o-series response. The reasoning.effort parameter (low/medium/high) gives you partial control. Set it to low for simple tasks.

Image Tokens: Add Up Faster Than You Think

Provider1024x1024 ImageCost at Mid-Tier
Claude~1,398 tokens$0.004 (Sonnet)
GPT-4o~765 tokens$0.002
Gemini~1,290 tokens$0.0004 (2.0 Flash)

Tiny per image. But 1,000 screenshots/day adds $4/day on Claude before any text.


Step 3: Add the Logger (10 Minutes)

Stop guessing. Drop this into your codebase and log every call with its actual cost.

import json
from datetime import datetime, timezone

# Pricing per million tokens (update when prices change)
PRICING = {
    # Anthropic
    "claude-opus-4": {"input": 15.0, "output": 75.0,
                      "cache_read": 1.50, "cache_write": 18.75},
    "claude-sonnet-4": {"input": 3.0, "output": 15.0,
                        "cache_read": 0.30, "cache_write": 3.75},
    "claude-haiku-3.5": {"input": 0.80, "output": 4.0,
                         "cache_read": 0.08, "cache_write": 1.00},
    # OpenAI
    "gpt-4o": {"input": 2.5, "output": 10.0, "cached_input": 1.25},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60, "cached_input": 0.075},
    "o3-mini": {"input": 1.10, "output": 4.40, "cached_input": 0.55},
    # DeepSeek
    "deepseek-chat": {"input": 0.27, "output": 1.10, "cached_input": 0.07},
    "deepseek-reasoner": {"input": 0.55, "output": 2.19, "cached_input": 0.14},
}

def log_anthropic(response, model, label=""):
    """Log an Anthropic API response. Returns cost in USD."""
    u = response.usage
    p = PRICING.get(model, {})
    cache_write = getattr(u, "cache_creation_input_tokens", 0)
    cache_read = getattr(u, "cache_read_input_tokens", 0)

    cost = (
        u.input_tokens * p.get("input", 0)
        + cache_write * p.get("cache_write", 0)
        + cache_read * p.get("cache_read", 0)
        + u.output_tokens * p.get("output", 0)
    ) / 1_000_000

    entry = {
        "ts": datetime.now(timezone.utc).isoformat(),
        "provider": "anthropic", "model": model, "label": label,
        "input": u.input_tokens, "cache_write": cache_write,
        "cache_read": cache_read, "output": u.output_tokens,
        "cost_usd": round(cost, 6),
    }
    _append(entry)
    return entry

def log_openai(response, model, label=""):
    """Log an OpenAI API response. Returns cost in USD."""
    u = response.usage
    p = PRICING.get(model, {})
    cached = 0
    reasoning = 0
    if u.prompt_tokens_details:
        cached = getattr(u.prompt_tokens_details, "cached_tokens", 0)
    if u.completion_tokens_details:
        reasoning = getattr(u.completion_tokens_details,
                            "reasoning_tokens", 0)

    uncached_input = u.prompt_tokens - cached
    cost = (
        uncached_input * p.get("input", 0)
        + cached * p.get("cached_input", p.get("input", 0))
        + u.completion_tokens * p.get("output", 0)
    ) / 1_000_000

    entry = {
        "ts": datetime.now(timezone.utc).isoformat(),
        "provider": "openai", "model": model, "label": label,
        "input": u.prompt_tokens, "cached": cached,
        "output": u.completion_tokens, "reasoning": reasoning,
        "cost_usd": round(cost, 6),
    }
    _append(entry)
    return entry

LOG_FILE = "token_log.jsonl"

def _append(entry):
    with open(LOG_FILE, "a") as f:
        f.write(json.dumps(entry) + "\n")

def daily_summary():
    """Print cost summary grouped by model."""
    totals = {}
    with open(LOG_FILE) as f:
        for line in f:
            e = json.loads(line)
            model = e["model"]
            totals.setdefault(model, {"calls": 0, "cost": 0.0})
            totals[model]["calls"] += 1
            totals[model]["cost"] += e["cost_usd"]

    print(f"{'Model':<25} {'Calls':>6} {'Cost':>10}")
    print("-" * 43)
    grand = 0
    for model, data in sorted(totals.items(), key=lambda x: -x[1]["cost"]):
        print(f"{model:<25} {data['calls']:>6} ${data['cost']:>9.4f}")
        grand += data["cost"]
    print("-" * 43)
    print(f"{'TOTAL':<25} {'':>6} ${grand:>9.4f}")

How to Use It

import anthropic

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain quicksort"}],
)
log_anthropic(response, "claude-sonnet-4", label="quicksort-explainer")
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quicksort"}],
)
log_openai(response, "gpt-4o", label="quicksort-explainer")

Run daily_summary() after a week. The output shows exactly where your money went.


Step 4: Enable Prompt Caching

If you’re sending the same system prompt, tool definitions, or context on repeated calls, caching drops those tokens to 10% of base cost. Nothing else in this guide saves you as much for as little work.

Anthropic

Mark static content with cache_control:

response = client.messages.create(
    model="claude-sonnet-4",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your long system prompt here...",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Your question"}],
)

First call writes the cache at 1.25x input rate. Every call within 5 minutes reads at 0.1x – a 90% discount.

Real numbers with a 10,000-token system prompt at Sonnet rates:

Without CachingWith Caching (after first call)
System prompt cost per call$0.030$0.003
1,000 calls/day$30.00$3.00 + $0.0375 cache write
Monthly$900$91

OpenAI

Automatic for prompts over 1,024 tokens. No code changes. Cached tokens bill at 50% (less aggressive than Anthropic’s 90%, but free to enable).

DeepSeek

Also automatic. Repeated prefixes hit cache pricing server-side. V3 drops from $0.27 to $0.07/MTok on cache hits – 74% off without any configuration.


Step 5: Route Tasks to Cheaper Models

Running everything through one model is the most common waste pattern. Different tasks need different horsepower.

Task TypeModelCost per 1K Calls (avg 500-token response)
Classification, routingDeepSeek V3 (cached)$0.04
Simple Q&A, formattingHaiku 3.5$2.40
Standard generationSonnet 4$9.00
Complex reasoningOpus 4.5/4.6$14.50
Maximum capabilityOpus 4$45.00

The cheapest frontier-class token (DeepSeek V3 cached) costs $0.07/MTok. The most expensive (Opus 4 output) costs $75/MTok. That’s a 1,071x spread. If you’re using one model for everything, you’re overpaying for simple tasks or underperforming on hard ones.


Step 6: Read the Usage Object

Every API response tells you exactly what you were charged. Here’s how to read it.

Claude (Anthropic)

{
  "usage": {
    "input_tokens": 50,
    "cache_creation_input_tokens": 1500,
    "cache_read_input_tokens": 18000,
    "output_tokens": 393
  }
}

The gotcha: input_tokens is NOT your total input. It’s only the tokens after the last cache breakpoint. Total input = input_tokens + cache_creation_input_tokens + cache_read_input_tokens. If you’re caching and only tracking input_tokens, you’re undercounting.

OpenAI

{
  "usage": {
    "prompt_tokens": 1250,
    "completion_tokens": 500,
    "total_tokens": 1750,
    "prompt_tokens_details": {
      "cached_tokens": 1000
    },
    "completion_tokens_details": {
      "reasoning_tokens": 200
    }
  }
}

Watch reasoning_tokens on o-series models. That’s invisible output you’re paying full price for. For streaming, add stream_options: {"include_usage": true} or you won’t get the usage object at all.

DeepSeek

Follows OpenAI’s format. Caching is server-side and automatic – check the dashboard for cache hit rates.


Current Pricing Reference

Prices as of early 2026, per million tokens.

Claude (Anthropic)

ModelInputOutputCache ReadCache Write (5min)
Opus 4$15.00$75.00$1.50$18.75
Opus 4.5/4.6$5.00$25.00$0.50$6.25
Sonnet 4/4.5$3.00$15.00$0.30$3.75
Haiku 3.5$0.80$4.00$0.08$1.00
Haiku 4.5$1.00$5.00$0.10$1.25

Batch API: 50% off everything. Long context (>200K tokens): input doubles, output goes to 1.5x.

OpenAI

ModelInputCached InputOutput
GPT-4o$2.50$1.25$10.00
GPT-4o-mini$0.15$0.075$0.60
GPT-4.1$2.00$0.50$8.00
o3$2.00$0.50$8.00
o3-mini$1.10$0.55$4.40
o1$15.00$7.50$60.00

Batch API: 50% off all models, results within 24 hours.

DeepSeek

ModelInput (Cache Hit)Input (Cache Miss)Output
DeepSeek-V3$0.07$0.27$1.10
DeepSeek-R1$0.14$0.55$2.19

Off-peak discounts (16:30-00:30 GMT): up to 75% off R1, 50% off V3.

Google Gemini

ModelInputOutputContext Window
Gemini 2.5 Pro$1.25$10.001M
Gemini 2.5 Flash$0.30$2.501M
Gemini 2.0 Flash$0.10$0.401M

When Local Models Win on Cost

At high enough volume, running your own hardware wipes out per-token costs.

Daily VolumeAPI Cost (Sonnet)API Cost (DeepSeek V3)Local Cost (RTX 3090)
100K tokens$0.90$0.07~$0.15 (electricity)
1M tokens$9.00$0.68~$0.15
10M tokens$90.00$6.80~$0.30
100M tokens$900.00$68.00~$0.50

At 1M tokens/day, DeepSeek is cheaper than your own hardware. At 10M/day, local starts winning. At 100M/day, local is 136x cheaper than Sonnet.

Break-even for a $750 used RTX 3090 setup: about 3 months at 1M tokens/day vs Sonnet, or 12-18 months at 10M/day vs DeepSeek.

The tradeoff is quality. Local models (Qwen 3 32B, Llama 3.3 70B) are good but not Sonnet-good for complex tasks. For classification, formatting, extraction, and embeddings, they match API quality at zero marginal cost. Full hardware and electricity math in our cost to run LLMs locally guide.


Quick-Reference: Cuts Ranked by Effort

What to DoEffortTypical Savings
Enable prompt cachingLow (one field)50-90% on repeated content
Use batch APILow (change endpoint)50% on async work
Trim system promptMedium (rewrite it)20-40% on input costs
Route tasks to cheaper modelsMedium (add routing)40-70% overall
Truncate conversation historyMedium (add summarization)30-60% on multi-turn
Remove unused tool definitionsLow (delete lines)5-15% on tool-heavy calls
Set reasoning effort to lowLow (one parameter)30-50% on o-series output
Use DeepSeek for simple tasksLow (swap model)80-95% vs Claude/OpenAI

What to do right now

The difference between a $500/month AI bill and a $50/month bill usually isn’t using less AI. It’s not paying for tokens you didn’t need to send.

Check your dashboard. Add the logger. Enable caching. Route cheap tasks to cheap models. Most people who do all four cut their bill in half within a month.