Before touching any code I exported the OpenAI usage CSV from the platform usage dashboard and pulled it into DuckDB. The dashboard's per-model breakdown is fine for a glance, but you can't slice it by your own request metadata, and that's where the real signal lives.
If you're already tagging requests with a user field (and you should be — it's free and OpenAI's Chat Completions reference documents it explicitly for abuse detection), you can also send a hashed feature flag in the same field to get per-feature cost attribution without standing up a custom metrics pipeline:
from openai import OpenAI
import hashlib
client = OpenAI()
def tagged_user(user_id: str, feature: str) -> str:
# OpenAI stores this opaquely; we use it for cost slicing.
return f"{hashlib.sha256(user_id.encode()).hexdigest()[:12]}:{feature}"
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
user=tagged_user(user_id, "ticket_classifier_v3"),
)
Once I had a week of tagged data I ran the obvious query and the answer was uncomfortable: 62% of our total token spend came from three endpoints that produced almost no user-visible value — a sidebar "related articles" widget, a background tag-suggester, and an overnight summarization job. The actual product features were a rounding error.
This is the part that nobody writes blog posts about, because it's not glamorous, but it's the only step that matters: you cannot optimize what you cannot attribute. Every team I've helped with an LLM bill has the same shape — a small number of background jobs eating 60–80% of the spend, and a feature owner who has no idea.
Step 2: Semantic caching killed 38% of our traffic
The biggest single win was a semantic cache in front of the sidebar widget. The widget asks "given this article, suggest 5 related articles from our corpus" — and we were calling the API on every page view, even though the same article gets viewed thousands of times a day with identical context.
A naive string-key cache catches exact matches but misses the long tail. A semantic cache embeds the prompt, looks up the nearest neighbour in a vector store, and returns the cached completion if the cosine similarity is above some threshold. I used text-embedding-3-small (cheap, $0.02 per million tokens) and Redis with the RediSearch vector module as the index:
import numpy as np
from openai import OpenAI
from redis import Redis
from redis.commands.search.query import Query
client = OpenAI()
r = Redis(decode_responses=False)
SIM_THRESHOLD = 0.93 # tune this — see caveat below
def embed(text: str) -> np.ndarray:
e = client.embeddings.create(
model="text-embedding-3-small",
input=text,
).data[0].embedding
return np.array(e, dtype=np.float32)
def cached_complete(prompt: str, model: str) -> str:
vec = embed(prompt)
q = (
Query("(*)=>[KNN 1 @v $vec AS score]")
.sort_by("score")
.return_fields("completion", "score")
.dialect(2)
)
hits = r.ft("llm_cache").search(q, {"vec": vec.tobytes()}).docs
if hits and (1 - float(hits[0].score)) >= SIM_THRESHOLD:
return hits[0].completion.decode()
completion = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
).choices[0].message.content
r.hset(
f"cache:{hash(prompt)}",
mapping={"v": vec.tobytes(), "completion": completion},
)
return completion
The threshold of 0.93 is not arbitrary — I A/B tested it against a held-out set of 2,000 prompt pairs where a human had labeled whether the same completion was acceptable for both. Below 0.91 the cache started serving subtly wrong responses (e.g., suggesting iOS articles for an Android question). Above 0.96 the hit rate collapsed to under 10%. Your number will be different; measure it, don't copy mine.
Hit rate on the sidebar widget settled at 71% after a week of warm-up, which dropped that endpoint's spend from roughly $3,400/month to $980/month. The embedding calls themselves cost about $40/month — completely negligible.
For more on caching strategies that translate across providers, see our writeup on LLM response caching patterns.
Step 3: Model tiering — stop paying gpt-4o prices for gpt-4o-mini work
The ticket classifier was the obvious offender. Someone had upgraded it to gpt-4o because "accuracy went up 4 points on the eval set." That's true, but the eval set was 200 hand-curated edge cases. On the actual production distribution, gpt-4o-mini was within 1.2 points and costs roughly 15× less per token as of May 2026 ($0.15 vs $2.50 per million input tokens).
I built a two-tier router: gpt-4o-mini handles everything, and we only escalate to gpt-4o when the mini model returns low confidence. The trick is getting calibrated confidence out of an LLM, which you can do with logprobs:
import math
def classify_with_escalation(ticket: str) -> tuple[str, str]:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": CLASSIFIER_PROMPT},
{"role": "user", "content": ticket},
],
logprobs=True,
top_logprobs=5,
max_tokens=3,
)
choice = resp.choices[0]
label = choice.message.content.strip()
top = choice.logprobs.content[0].top_logprobs
confidence = math.exp(top[0].logprob) # probability of top token
if confidence >= 0.85:
return label, "mini"
# Escalate ~8% of traffic to the big model.
resp2 = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": CLASSIFIER_PROMPT},
{"role": "user", "content": ticket},
],
max_tokens=3,
)
return resp2.choices[0].message.content.strip(), "escalated"
In production this routes 92% of tickets through mini and only escalates 8% to gpt-4o. End-to-end accuracy stayed within 0.3 points of the all-gpt-4o baseline, and the classifier's bill went from $4,800/month to $520/month.
One subtle thing: I don't use gpt-4o as the second tier blindly. If the mini answer was confident but seemed weird (e.g., a label that's never appeared before), I log the case but don't escalate — escalation is for genuine uncertainty, not for sanity-checking already-confident outputs.
Step 4: Move backfills and overnight jobs to the Batch API
The overnight summarization job was running 38,000 sequential requests against the synchronous endpoint. This was wrong on two levels: it was paying full price, and it was competing for the same rate-limit bucket as our live traffic, occasionally causing 429s on the user-facing API.
The Batch API gives a 50% discount on input and output tokens in exchange for a 24-hour SLA on completion. For anything that doesn't need to be real-time — backfills, nightly digests, ETL summarization, eval runs — this is a no-brainer. The shape of a batch job is a JSONL file uploaded to the Files API, then a single batch create call:
import json
from openai import OpenAI
client = OpenAI()
def build_batch(articles):
lines = []
for a in articles:
lines.append(json.dumps({
"custom_id": f"summary-{a['id']}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "Summarize in 80 words."},
{"role": "user", "content": a["body"]},
],
"max_tokens": 200,
},
}))
return "\n".join(lines).encode()
f = client.files.create(file=("nightly.jsonl", build_batch(articles)),
purpose="batch")
batch = client.batches.create(
input_file_id=f.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
Migrating the nightly job took about a day, mostly to handle the asynchronous result-fetching plumbing (you poll batches.retrieve(batch.id) until status is completed, then download the output file). It immediately halved that workload's cost and — more importantly — eliminated the 429 spikes we were seeing in our user-facing endpoints between 2 and 4 AM UTC.
Combined savings from this single change: ~$1,600/month, plus the indirect win of no longer needing to over-provision rate-limit headroom for the live API.
Step 5: Prompt pruning and rate shaping (the long tail)
After the big three changes I went hunting for the long tail. Two things mattered.
Prompt pruning. Our system prompts had accumulated cruft — examples that were no longer needed, polite instructions ("please respond accurately"), historical scaffolding from earlier prompt versions. I rewrote the four biggest prompts from scratch and dropped average input tokens per call from 1,840 to 720. Same outputs, measured on the same eval set. That alone saved about $400/month and improved p50 latency by 180ms because the model has less to read.
If you're using the same long system prompt across many calls, also turn on OpenAI's automatic prompt caching — you don't have to do anything, but you do have to structure your prompts to take advantage of it. Static content goes first, dynamic content last. Cached input tokens are billed at 50% of normal rate when the prefix matches.
Rate shaping. We were hitting bursty 429s during traffic spikes, and our retry logic was making it worse — naïve exponential backoff with no jitter caused thundering herds. I switched to a token-bucket client-side limiter sized to 80% of our actual rate limit, with jittered backoff on 429:
import asyncio, random, time
from collections import deque
class TokenBucket:
def __init__(self, rate_per_min: int):
self.capacity = rate_per_min
self.tokens = rate_per_min
self.refill_rate = rate_per_min / 60.0
self.last = time.monotonic()
self.lock = asyncio.Lock()
async def acquire(self):
async with self.lock:
now = time.monotonic()
self.tokens = min(
self.capacity,
self.tokens + (now - self.last) * self.refill_rate,
)
self.last = now
if self.tokens < 1:
wait = (1 - self.tokens) / self.refill_rate
await asyncio.sleep(wait + random.uniform(0, 0.05))
self.tokens = 0
else:
self.tokens -= 1
This stopped us paying for failed-then-retried requests (yes, you get billed for the input tokens of a request that returns a 5xx in some cases — check your invoice) and smoothed out the cost curve so capacity planning got easier.
Caveats: where I'd push back on this advice
Semantic caching is dangerous for anything that depends on freshness — pricing, inventory, news. Don't cache prompts that include time-sensitive data unless you're invalidating aggressively. I keep a deny-list of prompt prefixes that bypass the cache entirely.
Model tiering with confidence-based escalation assumes your prompts produce a single token (or small token sequence) you can read logprobs on. For free-form generation, confidence signals are noisier and you're better off using a separate smaller "judge" model to decide whether to regenerate with a bigger model. We tried this for marketing copy and the results were worse than just using gpt-4o-mini directly.
Batch API's 24-hour window is a real SLA — it usually finishes in 1–4 hours, but I've seen jobs sit for 22 hours during peak demand around big model launches. If your job has a hard deadline, build in a fallback to the sync API for the remainder.
Finally: all the dollar numbers in this post are our dollar numbers. Your traffic mix, prompt shapes, and accuracy tolerance are different. The methodology — attribute, cache, tier, batch, prune — generalizes. The specific savings won't.
FAQ
Does prompt caching work with all OpenAI models?
As of May 2026 it's enabled by default on gpt-4o, gpt-4o-mini, o1, o1-mini, and o3-mini, with a minimum cached prefix of 1,024 tokens. It doesn't apply to embeddings or the older gpt-3.5-turbo family. The official docs have the current eligibility list — check before assuming you're getting the discount.
Is semantic caching worth it if I'm already using OpenAI's built-in prompt caching?
Yes, they solve different problems. Prompt caching discounts the input tokens when the prefix matches; semantic caching avoids the call entirely when the whole prompt is semantically similar to something you've seen before. For a high-repetition workload (recommendation widgets, FAQ bots, support classifiers) semantic caching saves 10× more.
Should I move to a self-hosted open model like Llama 3.3?
Run the math first. A single H100 node on AWS is roughly $4/hour on-demand, or about $2,900/month reserved. That's only cheaper than the OpenAI API if you're sustaining heavy throughput; for spiky workloads the GPU sits idle and you lose. We benchmarked a Llama 3.3 70B deployment against our gpt-4o-mini spend and it didn't break even until ~$3,500/month of API usage on a single workload type — and that's before counting the engineering time to run it.
How do I stop a single bad deploy from blowing up the bill again?
Two things. One: set a hard monthly usage limit in the OpenAI dashboard — the API will start returning 429s when you hit it, which is much better than a five-figure surprise. Two: alert on rate of spend, not absolute spend. We use a simple Datadog monitor on tokens-per-minute with a per-feature breakdown; anything that doubles its baseline triggers a Slack ping before it does real damage.
Closing
The thing I keep telling teams is that LLM cost optimization isn't fundamentally different from any other cloud cost work. You attribute spend to features, you find the 80/20, you apply the cheapest tactic that works (caching), then the next cheapest (tiering), then the operationally heavier ones (batching, self-hosting). The reason people end up with $14k surprise bills is that the tooling for attribution is worse than it is for, say, EC2 — but the playbook is the same one you'd run on a runaway Lambda invocation.
If you want to take this further, our piece on running an LLM FinOps monthly review walks through the meeting cadence and the specific metrics we now track every month to keep this from happening again.