Token Audit Guide: Track What AI Actually Costs You
๐ More on this topic: Tiered Model Strategy ยท Cost to Run LLMs Locally ยท OpenClaw Token Optimization ยท Local LLMs vs ChatGPT
You’re spending more on AI APIs than you think. Every provider gives you the tools to see exactly where your tokens go โ but almost nobody actually checks. The result: system prompts bleeding tokens on every call, conversation history growing unchecked, tool definitions adding hundreds of hidden tokens, and OpenAI’s reasoning models burning output tokens on invisible “thinking” you never see.
This guide shows you how to audit your actual spending, where the hidden costs are, and how to build a simple tracker that catches waste before it compounds.
The Real Pricing: What Every Token Costs
Pricing changes frequently. These are the rates as of early 2026, per million tokens.
Claude (Anthropic)
| Model | Input | Output | Cache Read | Cache Write (5min) |
|---|---|---|---|---|
| Opus 4 | $15.00 | $75.00 | $1.50 | $18.75 |
| Opus 4.5/4.6 | $5.00 | $25.00 | $0.50 | $6.25 |
| Sonnet 4/4.5 | $3.00 | $15.00 | $0.30 | $3.75 |
| Haiku 3.5 | $0.80 | $4.00 | $0.08 | $1.00 |
| Haiku 4.5 | $1.00 | $5.00 | $0.10 | $1.25 |
Batch API: 50% off everything. Long context (>200K tokens): input doubles, output goes to 1.5x.
OpenAI
| Model | Input | Cached Input | Output |
|---|---|---|---|
| GPT-4o | $2.50 | $1.25 | $10.00 |
| GPT-4o-mini | $0.15 | $0.075 | $0.60 |
| GPT-4.1 | $2.00 | $0.50 | $8.00 |
| o3 | $2.00 | $0.50 | $8.00 |
| o3-mini | $1.10 | $0.55 | $4.40 |
| o1 | $15.00 | $7.50 | $60.00 |
Batch API: 50% off all models, results within 24 hours.
DeepSeek
| Model | Input (Cache Hit) | Input (Cache Miss) | Output |
|---|---|---|---|
| DeepSeek-V3 | $0.07 | $0.27 | $1.10 |
| DeepSeek-R1 | $0.14 | $0.55 | $2.19 |
Off-peak discounts (16:30-00:30 GMT): up to 75% off R1, 50% off V3. Caching is automatic โ repeated prefixes get cache-hit pricing without any configuration.
Google Gemini
| Model | Input | Output | Context Window |
|---|---|---|---|
| Gemini 2.5 Pro | $1.25 | $10.00 | 1M |
| Gemini 2.5 Flash | $0.30 | $2.50 | 1M |
| Gemini 2.0 Flash | $0.10 | $0.40 | 1M |
The Price Spread Is Absurd
The cheapest way to call a frontier-class model (DeepSeek V3, cached input) costs $0.07 per million tokens. The most expensive (Claude Opus 4 output) costs $75.00 per million tokens. That’s a 1,071x difference.
If you’re using one model for everything, you’re either overpaying for simple tasks or underperforming on complex ones. Neither is good. See our tiered model strategy guide for how to route tasks to the right model.
Where Your Tokens Actually Go
Most people think of token cost as “my prompt + the response.” The reality is more complicated. Here’s what actually gets billed on every API call.
System Prompts: The Tax on Every Request
Your system prompt gets sent as input tokens on every single API call. It doesn’t get “remembered” between calls โ it’s re-transmitted every time.
A 2,000-token system prompt doesn’t sound like much. But across 1,000 API calls per day:
- 2M extra input tokens/day
- At Sonnet rates ($3/MTok): $6/day = $180/month
- At Opus 4 rates ($15/MTok): $30/day = $900/month
Just from the system prompt. Before your actual questions even start.
Fix: Cut your system prompt to the minimum needed. A 500-token system prompt instead of 2,000 saves $135/month at Sonnet rates. Better yet, use prompt caching to drop that cost by 90%.
Conversation History: The Growing Snowball
In a multi-turn conversation, every previous message gets resent as input on every call. Turn 1 sends your message. Turn 2 sends turns 1 + 2. Turn 20 sends all 20 turns.
The math is ugly. A 20-turn conversation with 500 tokens per turn:
- Turn 1: 500 input tokens
- Turn 10: 5,000 input tokens
- Turn 20: 10,000 input tokens
- Total across all 20 turns: ~105,000 input tokens (the triangular sum)
That’s 10x more than the “20 turns ร 500 tokens = 10,000 tokens” you might expect.
Fix: Summarize or truncate conversation history. Send a summary of older turns instead of the full text. Or start fresh conversations more often.
Tool Definitions: Hidden Token Overhead
If you’re using function calling or tool use, every API call includes hidden tokens for the tool infrastructure โ even if no tools are invoked on that call.
Anthropic documents this precisely:
| Tool Configuration | Hidden Tokens Per Request |
|---|---|
auto or none tool choice | 346 tokens |
any or specific tool choice | 313 tokens |
| Each tool definition (avg) | ~150 tokens |
| Bash tool | 245 tokens |
| Text editor tool | 700 tokens |
| Computer use | 466-499 tokens + 735 per tool |
An agent with 5 tools averaging 150 tokens each adds ~1,100 tokens per request (346 base + 750 for definitions). That’s invisible in your prompt but very real in your bill.
OpenAI’s function calling has similar overhead โ tool schemas are serialized into the system prompt.
Fix: Only define tools you’ll actually use. Remove tools from API calls where they’re not needed. A call that just needs a text response doesn’t need 5 tool definitions attached.
OpenAI Reasoning Tokens: The Invisible Output Tax
This one is specific to OpenAI’s o-series models (o1, o3, o3-mini, o4-mini) and it’s the most deceptive cost on any AI platform.
When you call an o-series model, it generates internal “reasoning tokens” โ a chain-of-thought it uses to work through the problem. These tokens are:
- Billed as output tokens (the most expensive token type)
- Not visible in the API response content
- Not capped โ you can’t enforce a strict limit
A request that returns a 500-token visible response might actually consume 2,000+ total output tokens. At o1 rates ($60/MTok output), that’s the difference between $0.03 and $0.12 per request.
Check the completion_tokens_details.reasoning_tokens field in OpenAI responses. If you’re not tracking this, you’re flying blind on o-series costs.
The reasoning.effort parameter (low/medium/high) offers partial control. Set it to low for simple tasks to reduce reasoning token burn.
Image Tokens: More Than You’d Think
If you’re using vision capabilities, images are converted to tokens:
| Provider | 1024ร1024 Image | Cost at Mid-Tier Pricing |
|---|---|---|
| Claude | ~1,398 tokens | $0.004 (Sonnet) |
| GPT-4o | ~765 tokens | $0.002 |
| Gemini | ~1,290 tokens | $0.0004 (2.0 Flash) |
Seems tiny per image. But process 1,000 screenshots a day and you’re adding $4/day in input tokens on Claude alone โ before any text prompt.
How to Read Your Token Usage
Every major API returns a usage object in every response. Here’s how to read them.
Claude (Anthropic)
{
"usage": {
"input_tokens": 50,
"cache_creation_input_tokens": 1500,
"cache_read_input_tokens": 18000,
"output_tokens": 393
}
}
input_tokens: Tokens after the last cache breakpoint (not total input)cache_creation_input_tokens: Tokens written to cache on this request (billed at 1.25x input)cache_read_input_tokens: Tokens read from cache (billed at 0.1x input)output_tokens: Tokens generated in the response
Total input = input_tokens + cache_creation_input_tokens + cache_read_input_tokens
The key insight: input_tokens is NOT your total input. If you’re using caching and only tracking input_tokens, you’re undercounting by the size of your cached content.
OpenAI
{
"usage": {
"prompt_tokens": 1250,
"completion_tokens": 500,
"total_tokens": 1750,
"prompt_tokens_details": {
"cached_tokens": 1000
},
"completion_tokens_details": {
"reasoning_tokens": 200
}
}
}
prompt_tokens: Total input tokens (including cached)completion_tokens: Total output tokens (including reasoning)cached_tokens: How many input tokens came from cache (billed at 50% rate)reasoning_tokens: Hidden thinking tokens (o-series only โ billed as output but invisible in response)
For streaming, add stream_options: {"include_usage": true} or you won’t get the usage object at all.
DeepSeek
DeepSeek follows OpenAI’s format with prompt_tokens and completion_tokens. Caching is automatic and server-side โ you can check the dashboard for cache hit rates but individual responses don’t break out cached vs uncached tokens.
Build a Token Logger: 50 Lines of Python
Stop guessing. Log every API call with its token count and cost. Here’s a practical logger that works with both Anthropic and OpenAI SDKs.
import json
from datetime import datetime, timezone
# Pricing per million tokens (update when prices change)
PRICING = {
# Anthropic
"claude-opus-4": {"input": 15.0, "output": 75.0,
"cache_read": 1.50, "cache_write": 18.75},
"claude-sonnet-4": {"input": 3.0, "output": 15.0,
"cache_read": 0.30, "cache_write": 3.75},
"claude-haiku-3.5": {"input": 0.80, "output": 4.0,
"cache_read": 0.08, "cache_write": 1.00},
# OpenAI
"gpt-4o": {"input": 2.5, "output": 10.0, "cached_input": 1.25},
"gpt-4o-mini": {"input": 0.15, "output": 0.60, "cached_input": 0.075},
"o3-mini": {"input": 1.10, "output": 4.40, "cached_input": 0.55},
# DeepSeek
"deepseek-chat": {"input": 0.27, "output": 1.10, "cached_input": 0.07},
"deepseek-reasoner": {"input": 0.55, "output": 2.19, "cached_input": 0.14},
}
def log_anthropic(response, model, label=""):
"""Log an Anthropic API response. Returns cost in USD."""
u = response.usage
p = PRICING.get(model, {})
cache_write = getattr(u, "cache_creation_input_tokens", 0)
cache_read = getattr(u, "cache_read_input_tokens", 0)
cost = (
u.input_tokens * p.get("input", 0)
+ cache_write * p.get("cache_write", 0)
+ cache_read * p.get("cache_read", 0)
+ u.output_tokens * p.get("output", 0)
) / 1_000_000
entry = {
"ts": datetime.now(timezone.utc).isoformat(),
"provider": "anthropic", "model": model, "label": label,
"input": u.input_tokens, "cache_write": cache_write,
"cache_read": cache_read, "output": u.output_tokens,
"cost_usd": round(cost, 6),
}
_append(entry)
return entry
def log_openai(response, model, label=""):
"""Log an OpenAI API response. Returns cost in USD."""
u = response.usage
p = PRICING.get(model, {})
cached = 0
reasoning = 0
if u.prompt_tokens_details:
cached = getattr(u.prompt_tokens_details, "cached_tokens", 0)
if u.completion_tokens_details:
reasoning = getattr(u.completion_tokens_details,
"reasoning_tokens", 0)
uncached_input = u.prompt_tokens - cached
cost = (
uncached_input * p.get("input", 0)
+ cached * p.get("cached_input", p.get("input", 0))
+ u.completion_tokens * p.get("output", 0)
) / 1_000_000
entry = {
"ts": datetime.now(timezone.utc).isoformat(),
"provider": "openai", "model": model, "label": label,
"input": u.prompt_tokens, "cached": cached,
"output": u.completion_tokens, "reasoning": reasoning,
"cost_usd": round(cost, 6),
}
_append(entry)
return entry
LOG_FILE = "token_log.jsonl"
def _append(entry):
with open(LOG_FILE, "a") as f:
f.write(json.dumps(entry) + "\n")
def daily_summary():
"""Print cost summary grouped by model."""
totals = {}
with open(LOG_FILE) as f:
for line in f:
e = json.loads(line)
model = e["model"]
totals.setdefault(model, {"calls": 0, "cost": 0.0})
totals[model]["calls"] += 1
totals[model]["cost"] += e["cost_usd"]
print(f"{'Model':<25} {'Calls':>6} {'Cost':>10}")
print("-" * 43)
grand = 0
for model, data in sorted(totals.items(), key=lambda x: -x[1]["cost"]):
print(f"{model:<25} {data['calls']:>6} ${data['cost']:>9.4f}")
grand += data["cost"]
print("-" * 43)
print(f"{'TOTAL':<25} {'':>6} ${grand:>9.4f}")
Usage
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain quicksort"}],
)
log_anthropic(response, "claude-sonnet-4", label="quicksort-explainer")
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quicksort"}],
)
log_openai(response, "gpt-4o", label="quicksort-explainer")
Every call logs to token_log.jsonl. Run daily_summary() to see where your money went. Pipe it into a dashboard, cron it, or just check it when the bill surprises you.
Prompt Caching: The Biggest Single Savings
Prompt caching is the highest-leverage optimization available on any API. If you’re sending the same system prompt, tool definitions, or context documents repeatedly, caching reduces those tokens to 10% of the base cost.
How It Works (Anthropic)
Mark static content with cache_control:
response = client.messages.create(
model="claude-sonnet-4",
max_tokens=1024,
system=[
{
"type": "text",
"text": "Your long system prompt here...",
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "Your question"}],
)
First call: cached content is written at 1.25x the input rate. Subsequent calls within 5 minutes: cached content is read at 0.1x the input rate โ a 90% discount.
Real numbers with a 10,000-token system prompt at Sonnet rates:
| Without Caching | With Caching (after first call) | |
|---|---|---|
| System prompt cost per call | $0.030 | $0.003 |
| 1,000 calls/day | $30.00 | $3.00 + $0.0375 cache write |
| Monthly | $900 | $91 |
That’s a 90% reduction on your largest recurring cost.
How It Works (OpenAI)
OpenAI caching is automatic for prompts >1,024 tokens. No code changes needed. Cached tokens are billed at 50% of the input rate (less aggressive than Anthropic’s 90% discount, but free to enable).
How It Works (DeepSeek)
Also automatic. Repeated prefixes get cache-hit pricing server-side. DeepSeek V3 drops from $0.27 to $0.07 per MTok on cache hits โ a 74% reduction that happens without any configuration.
Cost Reduction Cheat Sheet
| Strategy | Effort | Savings |
|---|---|---|
| Enable prompt caching | Low (add one field) | 50-90% on repeated content |
| Use batch API | Low (change endpoint) | 50% on non-real-time work |
| Trim system prompt | Medium (rewrite prompt) | 20-40% on input costs |
| Model tiering | Medium (add routing logic) | 40-70% overall |
| Truncate conversation history | Medium (add summarization) | 30-60% on multi-turn costs |
| Remove unused tool definitions | Low (delete lines) | 5-15% on tool-heavy calls |
| Set reasoning effort to low | Low (add parameter) | 30-50% on o-series output |
| Use DeepSeek for non-critical tasks | Low (change model) | 80-95% vs Claude/OpenAI |
The Tiered Model Approach
Don’t use one model for everything. Route tasks to the cheapest model that handles them well:
| Task Type | Model | Cost per 1K Calls (avg 500-token response) |
|---|---|---|
| Classification, routing | DeepSeek V3 (cached) | $0.04 |
| Simple Q&A, formatting | Haiku 3.5 | $2.40 |
| Standard generation | Sonnet 4 | $9.00 |
| Complex reasoning | Opus 4.5/4.6 | $14.50 |
| Maximum capability | Opus 4 | $45.00 |
One team cut monthly spend from $3,200 to $1,100 โ a 66% reduction โ by routing tasks across three tiers instead of running everything through Sonnet. See our tiered model strategy guide for the full breakdown.
When Local Models Win on Cost
API pricing makes sense at low to moderate volume. But at scale, local models eliminate per-token costs entirely.
| Daily Volume | API Cost (Sonnet) | API Cost (DeepSeek V3) | Local Cost (RTX 3090) |
|---|---|---|---|
| 100K tokens | $0.90 | $0.07 | ~$0.15 (electricity) |
| 1M tokens | $9.00 | $0.68 | ~$0.15 |
| 10M tokens | $90.00 | $6.80 | ~$0.30 |
| 100M tokens | $900.00 | $68.00 | ~$0.50 |
At 1M tokens/day, DeepSeek V3 is cheaper than running your own hardware. At 10M tokens/day, local starts winning. At 100M tokens/day, local is 136x cheaper than Sonnet.
The break-even point for a $750 used RTX 3090 setup:
- vs Sonnet: ~3 months at 1M tokens/day
- vs DeepSeek V3: ~12-18 months at 10M tokens/day
The tradeoff: local models (Qwen 3 32B, Llama 3.3 70B) are good, but not Sonnet-good for complex tasks. For simple tasks โ classification, formatting, extraction, embeddings โ local models match API quality at zero marginal cost.
See our cost to run LLMs locally guide for the full hardware and electricity breakdown.
Run Your First Audit
Here’s a 15-minute audit you can run right now:
1. Check your dashboard. Anthropic, OpenAI, DeepSeek โ look at the last 7 days.
2. Find your top spending model. Is it the model you expected? If you’re spending 80% on Opus but only 20% of your tasks need it, that’s your biggest fix.
3. Check your average input size. If your average input is over 2,000 tokens and you’re not using caching, enable caching today. Literally today. It’s the single highest-ROI change.
4. Look for pattern waste. Are you making the same call repeatedly with the same system prompt? That’s a caching opportunity. Are you sending full conversation history in contexts that don’t need it? Truncate.
5. Add the logger. Drop the Python logger from this guide into your codebase. Run it for one week. The daily_summary() output will show you exactly where your money goes โ and it’s usually not where you think.
The Bottom Line
AI API costs are predictable and controllable โ if you actually measure them. The usage object is in every API response. The tools exist. Most people just don’t look.
The biggest wins, in order:
- Enable prompt caching โ 50-90% savings on repeated content, one line of code
- Use the right model for each task โ tiered routing cuts 40-70% overall
- Trim your system prompt โ every token saved multiplies across every call
- Log everything โ you can’t optimize what you don’t measure
- Consider local for high-volume, lower-complexity tasks โ zero marginal cost after hardware
The difference between a $500/month AI bill and a $50/month AI bill usually isn’t using less AI. It’s using the right AI for each task and not paying for tokens you didn’t need to send.