๐Ÿ“š More on this topic: Tiered Model Strategy ยท Cost to Run LLMs Locally ยท OpenClaw Token Optimization ยท Local LLMs vs ChatGPT

You’re spending more on AI APIs than you think. Every provider gives you the tools to see exactly where your tokens go โ€” but almost nobody actually checks. The result: system prompts bleeding tokens on every call, conversation history growing unchecked, tool definitions adding hundreds of hidden tokens, and OpenAI’s reasoning models burning output tokens on invisible “thinking” you never see.

This guide shows you how to audit your actual spending, where the hidden costs are, and how to build a simple tracker that catches waste before it compounds.


The Real Pricing: What Every Token Costs

Pricing changes frequently. These are the rates as of early 2026, per million tokens.

Claude (Anthropic)

ModelInputOutputCache ReadCache Write (5min)
Opus 4$15.00$75.00$1.50$18.75
Opus 4.5/4.6$5.00$25.00$0.50$6.25
Sonnet 4/4.5$3.00$15.00$0.30$3.75
Haiku 3.5$0.80$4.00$0.08$1.00
Haiku 4.5$1.00$5.00$0.10$1.25

Batch API: 50% off everything. Long context (>200K tokens): input doubles, output goes to 1.5x.

OpenAI

ModelInputCached InputOutput
GPT-4o$2.50$1.25$10.00
GPT-4o-mini$0.15$0.075$0.60
GPT-4.1$2.00$0.50$8.00
o3$2.00$0.50$8.00
o3-mini$1.10$0.55$4.40
o1$15.00$7.50$60.00

Batch API: 50% off all models, results within 24 hours.

DeepSeek

ModelInput (Cache Hit)Input (Cache Miss)Output
DeepSeek-V3$0.07$0.27$1.10
DeepSeek-R1$0.14$0.55$2.19

Off-peak discounts (16:30-00:30 GMT): up to 75% off R1, 50% off V3. Caching is automatic โ€” repeated prefixes get cache-hit pricing without any configuration.

Google Gemini

ModelInputOutputContext Window
Gemini 2.5 Pro$1.25$10.001M
Gemini 2.5 Flash$0.30$2.501M
Gemini 2.0 Flash$0.10$0.401M

The Price Spread Is Absurd

The cheapest way to call a frontier-class model (DeepSeek V3, cached input) costs $0.07 per million tokens. The most expensive (Claude Opus 4 output) costs $75.00 per million tokens. That’s a 1,071x difference.

If you’re using one model for everything, you’re either overpaying for simple tasks or underperforming on complex ones. Neither is good. See our tiered model strategy guide for how to route tasks to the right model.


Where Your Tokens Actually Go

Most people think of token cost as “my prompt + the response.” The reality is more complicated. Here’s what actually gets billed on every API call.

System Prompts: The Tax on Every Request

Your system prompt gets sent as input tokens on every single API call. It doesn’t get “remembered” between calls โ€” it’s re-transmitted every time.

A 2,000-token system prompt doesn’t sound like much. But across 1,000 API calls per day:

  • 2M extra input tokens/day
  • At Sonnet rates ($3/MTok): $6/day = $180/month
  • At Opus 4 rates ($15/MTok): $30/day = $900/month

Just from the system prompt. Before your actual questions even start.

Fix: Cut your system prompt to the minimum needed. A 500-token system prompt instead of 2,000 saves $135/month at Sonnet rates. Better yet, use prompt caching to drop that cost by 90%.

Conversation History: The Growing Snowball

In a multi-turn conversation, every previous message gets resent as input on every call. Turn 1 sends your message. Turn 2 sends turns 1 + 2. Turn 20 sends all 20 turns.

The math is ugly. A 20-turn conversation with 500 tokens per turn:

  • Turn 1: 500 input tokens
  • Turn 10: 5,000 input tokens
  • Turn 20: 10,000 input tokens
  • Total across all 20 turns: ~105,000 input tokens (the triangular sum)

That’s 10x more than the “20 turns ร— 500 tokens = 10,000 tokens” you might expect.

Fix: Summarize or truncate conversation history. Send a summary of older turns instead of the full text. Or start fresh conversations more often.

Tool Definitions: Hidden Token Overhead

If you’re using function calling or tool use, every API call includes hidden tokens for the tool infrastructure โ€” even if no tools are invoked on that call.

Anthropic documents this precisely:

Tool ConfigurationHidden Tokens Per Request
auto or none tool choice346 tokens
any or specific tool choice313 tokens
Each tool definition (avg)~150 tokens
Bash tool245 tokens
Text editor tool700 tokens
Computer use466-499 tokens + 735 per tool

An agent with 5 tools averaging 150 tokens each adds ~1,100 tokens per request (346 base + 750 for definitions). That’s invisible in your prompt but very real in your bill.

OpenAI’s function calling has similar overhead โ€” tool schemas are serialized into the system prompt.

Fix: Only define tools you’ll actually use. Remove tools from API calls where they’re not needed. A call that just needs a text response doesn’t need 5 tool definitions attached.

OpenAI Reasoning Tokens: The Invisible Output Tax

This one is specific to OpenAI’s o-series models (o1, o3, o3-mini, o4-mini) and it’s the most deceptive cost on any AI platform.

When you call an o-series model, it generates internal “reasoning tokens” โ€” a chain-of-thought it uses to work through the problem. These tokens are:

  • Billed as output tokens (the most expensive token type)
  • Not visible in the API response content
  • Not capped โ€” you can’t enforce a strict limit

A request that returns a 500-token visible response might actually consume 2,000+ total output tokens. At o1 rates ($60/MTok output), that’s the difference between $0.03 and $0.12 per request.

Check the completion_tokens_details.reasoning_tokens field in OpenAI responses. If you’re not tracking this, you’re flying blind on o-series costs.

The reasoning.effort parameter (low/medium/high) offers partial control. Set it to low for simple tasks to reduce reasoning token burn.

Image Tokens: More Than You’d Think

If you’re using vision capabilities, images are converted to tokens:

Provider1024ร—1024 ImageCost at Mid-Tier Pricing
Claude~1,398 tokens$0.004 (Sonnet)
GPT-4o~765 tokens$0.002
Gemini~1,290 tokens$0.0004 (2.0 Flash)

Seems tiny per image. But process 1,000 screenshots a day and you’re adding $4/day in input tokens on Claude alone โ€” before any text prompt.


How to Read Your Token Usage

Every major API returns a usage object in every response. Here’s how to read them.

Claude (Anthropic)

{
  "usage": {
    "input_tokens": 50,
    "cache_creation_input_tokens": 1500,
    "cache_read_input_tokens": 18000,
    "output_tokens": 393
  }
}
  • input_tokens: Tokens after the last cache breakpoint (not total input)
  • cache_creation_input_tokens: Tokens written to cache on this request (billed at 1.25x input)
  • cache_read_input_tokens: Tokens read from cache (billed at 0.1x input)
  • output_tokens: Tokens generated in the response

Total input = input_tokens + cache_creation_input_tokens + cache_read_input_tokens

The key insight: input_tokens is NOT your total input. If you’re using caching and only tracking input_tokens, you’re undercounting by the size of your cached content.

OpenAI

{
  "usage": {
    "prompt_tokens": 1250,
    "completion_tokens": 500,
    "total_tokens": 1750,
    "prompt_tokens_details": {
      "cached_tokens": 1000
    },
    "completion_tokens_details": {
      "reasoning_tokens": 200
    }
  }
}
  • prompt_tokens: Total input tokens (including cached)
  • completion_tokens: Total output tokens (including reasoning)
  • cached_tokens: How many input tokens came from cache (billed at 50% rate)
  • reasoning_tokens: Hidden thinking tokens (o-series only โ€” billed as output but invisible in response)

For streaming, add stream_options: {"include_usage": true} or you won’t get the usage object at all.

DeepSeek

DeepSeek follows OpenAI’s format with prompt_tokens and completion_tokens. Caching is automatic and server-side โ€” you can check the dashboard for cache hit rates but individual responses don’t break out cached vs uncached tokens.


Build a Token Logger: 50 Lines of Python

Stop guessing. Log every API call with its token count and cost. Here’s a practical logger that works with both Anthropic and OpenAI SDKs.

import json
from datetime import datetime, timezone

# Pricing per million tokens (update when prices change)
PRICING = {
    # Anthropic
    "claude-opus-4": {"input": 15.0, "output": 75.0,
                      "cache_read": 1.50, "cache_write": 18.75},
    "claude-sonnet-4": {"input": 3.0, "output": 15.0,
                        "cache_read": 0.30, "cache_write": 3.75},
    "claude-haiku-3.5": {"input": 0.80, "output": 4.0,
                         "cache_read": 0.08, "cache_write": 1.00},
    # OpenAI
    "gpt-4o": {"input": 2.5, "output": 10.0, "cached_input": 1.25},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60, "cached_input": 0.075},
    "o3-mini": {"input": 1.10, "output": 4.40, "cached_input": 0.55},
    # DeepSeek
    "deepseek-chat": {"input": 0.27, "output": 1.10, "cached_input": 0.07},
    "deepseek-reasoner": {"input": 0.55, "output": 2.19, "cached_input": 0.14},
}

def log_anthropic(response, model, label=""):
    """Log an Anthropic API response. Returns cost in USD."""
    u = response.usage
    p = PRICING.get(model, {})
    cache_write = getattr(u, "cache_creation_input_tokens", 0)
    cache_read = getattr(u, "cache_read_input_tokens", 0)

    cost = (
        u.input_tokens * p.get("input", 0)
        + cache_write * p.get("cache_write", 0)
        + cache_read * p.get("cache_read", 0)
        + u.output_tokens * p.get("output", 0)
    ) / 1_000_000

    entry = {
        "ts": datetime.now(timezone.utc).isoformat(),
        "provider": "anthropic", "model": model, "label": label,
        "input": u.input_tokens, "cache_write": cache_write,
        "cache_read": cache_read, "output": u.output_tokens,
        "cost_usd": round(cost, 6),
    }
    _append(entry)
    return entry

def log_openai(response, model, label=""):
    """Log an OpenAI API response. Returns cost in USD."""
    u = response.usage
    p = PRICING.get(model, {})
    cached = 0
    reasoning = 0
    if u.prompt_tokens_details:
        cached = getattr(u.prompt_tokens_details, "cached_tokens", 0)
    if u.completion_tokens_details:
        reasoning = getattr(u.completion_tokens_details,
                            "reasoning_tokens", 0)

    uncached_input = u.prompt_tokens - cached
    cost = (
        uncached_input * p.get("input", 0)
        + cached * p.get("cached_input", p.get("input", 0))
        + u.completion_tokens * p.get("output", 0)
    ) / 1_000_000

    entry = {
        "ts": datetime.now(timezone.utc).isoformat(),
        "provider": "openai", "model": model, "label": label,
        "input": u.prompt_tokens, "cached": cached,
        "output": u.completion_tokens, "reasoning": reasoning,
        "cost_usd": round(cost, 6),
    }
    _append(entry)
    return entry

LOG_FILE = "token_log.jsonl"

def _append(entry):
    with open(LOG_FILE, "a") as f:
        f.write(json.dumps(entry) + "\n")

def daily_summary():
    """Print cost summary grouped by model."""
    totals = {}
    with open(LOG_FILE) as f:
        for line in f:
            e = json.loads(line)
            model = e["model"]
            totals.setdefault(model, {"calls": 0, "cost": 0.0})
            totals[model]["calls"] += 1
            totals[model]["cost"] += e["cost_usd"]

    print(f"{'Model':<25} {'Calls':>6} {'Cost':>10}")
    print("-" * 43)
    grand = 0
    for model, data in sorted(totals.items(), key=lambda x: -x[1]["cost"]):
        print(f"{model:<25} {data['calls']:>6} ${data['cost']:>9.4f}")
        grand += data["cost"]
    print("-" * 43)
    print(f"{'TOTAL':<25} {'':>6} ${grand:>9.4f}")

Usage

import anthropic

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain quicksort"}],
)
log_anthropic(response, "claude-sonnet-4", label="quicksort-explainer")
from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quicksort"}],
)
log_openai(response, "gpt-4o", label="quicksort-explainer")

Every call logs to token_log.jsonl. Run daily_summary() to see where your money went. Pipe it into a dashboard, cron it, or just check it when the bill surprises you.


Prompt Caching: The Biggest Single Savings

Prompt caching is the highest-leverage optimization available on any API. If you’re sending the same system prompt, tool definitions, or context documents repeatedly, caching reduces those tokens to 10% of the base cost.

How It Works (Anthropic)

Mark static content with cache_control:

response = client.messages.create(
    model="claude-sonnet-4",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "Your long system prompt here...",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "Your question"}],
)

First call: cached content is written at 1.25x the input rate. Subsequent calls within 5 minutes: cached content is read at 0.1x the input rate โ€” a 90% discount.

Real numbers with a 10,000-token system prompt at Sonnet rates:

Without CachingWith Caching (after first call)
System prompt cost per call$0.030$0.003
1,000 calls/day$30.00$3.00 + $0.0375 cache write
Monthly$900$91

That’s a 90% reduction on your largest recurring cost.

How It Works (OpenAI)

OpenAI caching is automatic for prompts >1,024 tokens. No code changes needed. Cached tokens are billed at 50% of the input rate (less aggressive than Anthropic’s 90% discount, but free to enable).

How It Works (DeepSeek)

Also automatic. Repeated prefixes get cache-hit pricing server-side. DeepSeek V3 drops from $0.27 to $0.07 per MTok on cache hits โ€” a 74% reduction that happens without any configuration.


Cost Reduction Cheat Sheet

StrategyEffortSavings
Enable prompt cachingLow (add one field)50-90% on repeated content
Use batch APILow (change endpoint)50% on non-real-time work
Trim system promptMedium (rewrite prompt)20-40% on input costs
Model tieringMedium (add routing logic)40-70% overall
Truncate conversation historyMedium (add summarization)30-60% on multi-turn costs
Remove unused tool definitionsLow (delete lines)5-15% on tool-heavy calls
Set reasoning effort to lowLow (add parameter)30-50% on o-series output
Use DeepSeek for non-critical tasksLow (change model)80-95% vs Claude/OpenAI

The Tiered Model Approach

Don’t use one model for everything. Route tasks to the cheapest model that handles them well:

Task TypeModelCost per 1K Calls (avg 500-token response)
Classification, routingDeepSeek V3 (cached)$0.04
Simple Q&A, formattingHaiku 3.5$2.40
Standard generationSonnet 4$9.00
Complex reasoningOpus 4.5/4.6$14.50
Maximum capabilityOpus 4$45.00

One team cut monthly spend from $3,200 to $1,100 โ€” a 66% reduction โ€” by routing tasks across three tiers instead of running everything through Sonnet. See our tiered model strategy guide for the full breakdown.


When Local Models Win on Cost

API pricing makes sense at low to moderate volume. But at scale, local models eliminate per-token costs entirely.

Daily VolumeAPI Cost (Sonnet)API Cost (DeepSeek V3)Local Cost (RTX 3090)
100K tokens$0.90$0.07~$0.15 (electricity)
1M tokens$9.00$0.68~$0.15
10M tokens$90.00$6.80~$0.30
100M tokens$900.00$68.00~$0.50

At 1M tokens/day, DeepSeek V3 is cheaper than running your own hardware. At 10M tokens/day, local starts winning. At 100M tokens/day, local is 136x cheaper than Sonnet.

The break-even point for a $750 used RTX 3090 setup:

  • vs Sonnet: ~3 months at 1M tokens/day
  • vs DeepSeek V3: ~12-18 months at 10M tokens/day

The tradeoff: local models (Qwen 3 32B, Llama 3.3 70B) are good, but not Sonnet-good for complex tasks. For simple tasks โ€” classification, formatting, extraction, embeddings โ€” local models match API quality at zero marginal cost.

See our cost to run LLMs locally guide for the full hardware and electricity breakdown.


Run Your First Audit

Here’s a 15-minute audit you can run right now:

1. Check your dashboard. Anthropic, OpenAI, DeepSeek โ€” look at the last 7 days.

2. Find your top spending model. Is it the model you expected? If you’re spending 80% on Opus but only 20% of your tasks need it, that’s your biggest fix.

3. Check your average input size. If your average input is over 2,000 tokens and you’re not using caching, enable caching today. Literally today. It’s the single highest-ROI change.

4. Look for pattern waste. Are you making the same call repeatedly with the same system prompt? That’s a caching opportunity. Are you sending full conversation history in contexts that don’t need it? Truncate.

5. Add the logger. Drop the Python logger from this guide into your codebase. Run it for one week. The daily_summary() output will show you exactly where your money goes โ€” and it’s usually not where you think.


The Bottom Line

AI API costs are predictable and controllable โ€” if you actually measure them. The usage object is in every API response. The tools exist. Most people just don’t look.

The biggest wins, in order:

  1. Enable prompt caching โ€” 50-90% savings on repeated content, one line of code
  2. Use the right model for each task โ€” tiered routing cuts 40-70% overall
  3. Trim your system prompt โ€” every token saved multiplies across every call
  4. Log everything โ€” you can’t optimize what you don’t measure
  5. Consider local for high-volume, lower-complexity tasks โ€” zero marginal cost after hardware

The difference between a $500/month AI bill and a $50/month AI bill usually isn’t using less AI. It’s using the right AI for each task and not paying for tokens you didn’t need to send.