Stop Treating Token Usage Like A Prompt Problem

The era of pretending an agent session is “one request” is ending.

That was always a suspiciously tidy fiction. A quick chat question and a coding agent that reads half a repository, calls tools, runs tests, rewrites files, explains itself, gets steered, and opens a pull request do not cost the same thing to run. They just looked similar when the bill arrived wrapped in a friendly subscription or a chunky request counter.

Now the meter is getting less romantic.

GitHub’s Copilot billing change is the loudest recent signal for everyday developers. Starting June 1, 2026, GitHub says Copilot usage moves to AI Credits, with usage calculated from token consumption: input tokens, output tokens, and cached tokens, priced by model. The company’s explanation is blunt in the way finance departments are accidentally poetic: Copilot has evolved from an in-editor assistant into an agentic platform capable of long, multi-step sessions across entire repositories.

In other words: the agent grew legs, learned to wander around the codebase, and now the receipt has footnotes.

This is not only a GitHub story. OpenAI and Anthropic pricing docs already expose the same shape underneath: regular input, cached input, output, batch processing, model tier, reasoning effort, tool calls, and agent handoffs all change cost. The details differ by provider. The lesson does not.

Token usage is workflow design.

If your agent spends too much, the first question is not “How do I make the prompt shorter?” The better question is:

Why is this workflow buying these tokens?

The Bill Is A Trace

Token spend is easy to misunderstand because the visible interaction is small.

You type:

Fix the checkout bug.

The agent sees something much larger:

system instructions
developer instructions
repository guidance
tool definitions
selected chat history
files it reads
search results
test output
tool responses
intermediate reasoning budget
subagent summaries
generated patches
final explanation

Then it may do that again. And again. And again, because the first test failed for a different reason and now the agent wants to “inspect related files,” a phrase that should make your budget sit up straight.

OpenAI’s Agents SDK docs make this visible in a useful way: usage can be aggregated across a full run, including tool calls and handoffs, and broken down into input tokens, output tokens, cached input, and reasoning tokens. That is the right level to watch. Not “what did this one prompt cost?” but “what did this whole loop cost to finish the job?”

The same idea shows up in coding tools. Anthropic’s Claude Code cost docs point at model selection, codebase size, multiple instances, automation, stale context, and compaction as cost drivers. Their agent-loop docs go even lower-level: tool definitions consume context, MCP server schemas can add significant context to every request in some configurations, subagents have their own conversations, and lower effort levels can reduce token usage for routine work.

The bill is not a prompt receipt. It is a workflow trace with dollar signs.

That is good news, oddly enough. Traces can be improved.

The Five-Line Token Budget

When an AI workflow gets expensive, split the budget into five lines.

1. Admission
   Should this task use an agent at all?

2. Context
   What must the model see for this step?

3. Capability
   Which model, tools, and effort level are justified?

4. Loop
   How many turns, calls, retries, and handoffs happen before done?

5. Output
   How much generated text or code is actually useful?

Most teams try to optimize line two first. They trim prompts. They shorten instructions. They beg the model to be concise, which is sometimes like asking a logging framework to “just be normal.”

Prompt size matters. But it is only one line.

If a frontier model is doing file discovery that rg could do in 80 milliseconds, your admission line is leaking. If every turn exposes 90 tools, your capability line is leaking. If the agent retries the same failing approach because compaction erased the decision trail, your loop line is leaking. If the model writes a full design essay when the next system expects three JSON fields, your output line is leaking.

Token reduction starts to work when you stop treating tokens as text and start treating them as purchased attention.

Spend them where attention changes the outcome.

Admission: Do We Need An Agent Here?

The cheapest token is the one you do not ask a model to process because ordinary software already knows the answer.

That sounds obvious until you inspect real workflows. Agents often get handed tasks that are half uncertainty and half plumbing:

find files matching a pattern
list changed routes
parse package metadata
format JSON
sort errors by frequency
extract a field from logs
run the same check across many rows
turn a config object into a table

Some of those tasks may belong inside an agent workflow, but they do not all need model judgment. Use code, shell commands, database queries, static analysis, or typed validators for the mechanical parts. Save the model for the places where ambiguity actually matters.

A useful rule:

If the task has a deterministic verifier and no semantic ambiguity, try automation first.

For a coding assistant, that might mean using search to locate candidates before asking the model to reason about them. For a product agent, it might mean validating permissions and fetching the exact policy record before the model writes the response. For an eval pipeline, it might mean using simple rules to separate obvious cases from ambiguous ones, then sending only the hard slice to a stronger model.

This is not anti-agent. It is pro-agent-being-worth-it.

Agents are best when they coordinate uncertain work: interpreting intent, choosing among tradeoffs, synthesizing evidence, writing code that fits a local style, or explaining why a fix is risky. They are expensive forklifts for moving ambiguity. Do not use them as teaspoons.

Context: Reuse The Stable Stuff, Stop Carrying The Furniture

Context is where cost and reliability shake hands.

The published Coffee With Humans context-engineering piece already makes the reliability point: bigger context is not automatically better context. Here, the cost version is simpler:

Repeated context is rent.

If every request includes the same system prompt, coding standards, tool descriptions, policy manual, examples, and conversation history, you are paying repeatedly for the model to ingest familiar furniture. Sometimes that is necessary. Often it is lazy architecture wearing a nice prompt.

Prompt caching is one of the first levers to understand. OpenAI’s docs say prompt caching works automatically for prompts of at least 1024 tokens, exposes cached token counts in usage metadata, and is most useful when requests share an identical prefix. Anthropic’s prompt caching docs make the economics explicit in their own system: cache writes, cache reads, and uncached input are priced differently, with 5-minute and 1-hour cache options.

The engineering lesson is not “turn on caching” and go make coffee. It is:

Put stable context where it can stay stable.
Put volatile context where it cannot break the stable prefix.

Stable context:

system instructions
role definition
output contract
long-lived examples
durable policy text
tool descriptions that actually belong in the session
repository conventions

Volatile context:

the user’s new request
retrieved snippets
tool results
current error output
temporary scratch work
partial model output

If the volatile stuff appears before the stable prefix, or if you keep rewriting the stable instructions every turn, you make caching harder. Even without provider-specific cache controls, this mental model helps: separate the reusable setup from the per-turn evidence.

There is a second move: stop carrying old work as chat history when compact state would do.

Bad state transfer:

We investigated the bug. It seems related to checkout validation.

Better state transfer:

Checkout bug state:
- Repro: `npm test checkout.spec.ts`.
- Constraint: do not touch legacy subscription flow.
- Ruled out: tax calculation; mocked totals are correct.
- Current hypothesis: saved-card payload lacks `billingAddress.country`.
- Next files: `PaymentForm.tsx`, `checkoutSchema.ts`, `savedCards.fixture.ts`.

The second version is not “shorter” in a character-count contest. It is cheaper in the way that matters because it prevents rediscovery. A summary that erases evidence often buys you a smaller next prompt and a larger next loop.

Capability: Match The Tool To The Uncertainty

Model routing is where token budgets become product decisions.

Not every step deserves the strongest model. Not every step deserves long reasoning. Not every agent needs every tool. And yes, this is where things get spicy, because “use the best model for everything” is emotionally comfortable. It feels like buying the good umbrella before a storm.

But workflows have different uncertainty levels.

Low uncertainty:

list files
classify obvious rows
produce a short structured summary
rewrite text to match a fixed format
inspect logs for known error strings
make a small mechanical edit with clear tests

High uncertainty:

design a migration plan
debug a cross-module failure
choose between competing architecture options
review a security-sensitive patch
synthesize conflicting source material
decide what not to change

The first group can often use smaller models, lower effort settings, deterministic tooling, or batch processing. The second group may justify a stronger model because mistakes are expensive.

Anthropic’s agent-loop docs explicitly recommend lower effort for routine tasks. GitHub’s AI Credit docs now make a similar idea financially concrete: a quick interaction with a lightweight model and a long cloud-agent session with a frontier model do not consume the same budget because they do different amounts of work.

Tool scope belongs in this same bucket. Tool definitions are context. Tool results are context. MCP server schemas can be context. A huge tool belt is not free just because the agent does not call every tool.

Give agents the tools they need for the current job:

A documentation agent probably does not need deploy tools.
A read-only code explorer probably does not need write tools.
A formatting worker probably does not need database access.
A subagent investigating tests probably does not need the entire MCP directory.

Good tool scope saves tokens and reduces accidental behavior. That is a pleasant two-for-one, like finding out the cheaper coffee is also the one that does not taste like printer toner.

Loop: Batch Cold Work, Keep Hot Work Small

Some AI work needs to happen while a user waits.

Most does not.

This distinction is one of the cleanest cost levers. OpenAI’s Batch API docs describe asynchronous jobs with 50% lower costs and a 24-hour turnaround for non-immediate work such as evals, classification, embeddings, and offline processing. Anthropic’s pricing docs also describe a 50% Batch API discount on input and output tokens.

That is not a minor optimization. It is an architecture hint.

Hot path:

user is waiting
interactive coding session
incident response
live support answer
approval-gated agent step

Cold path:

nightly evals
dataset classification
embedding refreshes
document extraction
issue deduplication
offline codebase analysis
report generation

If you run cold work through hot-path agents, you pay for immediacy you do not need. If you run hot work through batch queues, users develop the calm patience of someone watching a progress bar named “maybe today.”

Separate them.

For example, a coding team might keep the interactive agent focused on the current bug while a batch job precomputes repository maps, flaky-test clusters, or issue summaries overnight. A support product might use live model calls for the user’s actual question but batch policy-index enrichment and historical ticket classification. An eval system should almost never be a human sitting there firing synchronous calls row by row like a Victorian factory scene with JSON.

The other loop leak is retries.

Retries are not bad. Blind retries are.

If the model fails validation, the next turn should receive compact, useful feedback:

Validation failed:
- `status` must be one of: accepted, rejected, needs_review.
- Received: "probably fine".
- Keep the rationale under 40 words.

That is a good token purchase. It changes the next attempt. A 900-token apology plus a vague “try again” does not.

Output: Shorter Answers Are Not Just Faster

Output tokens are easy to ignore because they feel like the model’s problem.

They are your bill too.

They also dominate latency in many workflows. OpenAI’s latency guidance has long made the practical point that generating fewer tokens usually means faster responses. The cost version is even more direct: output tokens are often priced higher than input tokens, and verbose outputs can become future input if your application keeps conversation history.

So ask for the output you need.

Not:

Explain everything you did in detail.

Better:

Return:
- files changed
- why the change fixes the failure
- tests run
- remaining risk

Keep each bullet under 25 words.

For product agents, use structured output when another system consumes the result. For coding agents, ask for a reviewer-oriented summary instead of a travel diary. For eval classifiers, store compact labels and reserve explanations for sampled disagreements or low-confidence cases.

This is not about making agents terse for aesthetic reasons. It is about preventing output from becoming expensive exhaust.

A good output has a job. When it finishes the job, it stops.

Barista’s Tip: Run A Token-Budget Review

Before you panic-upgrade, rage-downgrade, or paste “be concise” into every prompt like a warding charm, review the workflow.

1. Measure The Run
   Are you logging full-run usage, including tool calls, handoffs,
   cached tokens, reasoning tokens, retries, and output?

2. Name The Expensive Path
   Which task, user action, agent, model, tool, or retry pattern
   creates the most spend?

3. Check Admission
   Which parts can be deterministic code, search, validation,
   static analysis, SQL, or a smaller model?

4. Split Hot And Cold
   Which work can move to batch, queues, nightly jobs, or
   precomputation without hurting user experience?

5. Stabilize The Prefix
   Are system instructions, examples, durable policies, and
   repository conventions arranged so caching can help?

6. Trim Repeated Context
   What gets resent every turn even though compact state would
   preserve the useful decision trail?

7. Scope Tools
   Which tool schemas, MCP servers, or capabilities are visible
   even when this task cannot use them?

8. Route Capability
   Where can lower effort, smaller models, or specialized workers
   handle routine steps?

9. Tighten Output
   Does the model produce exactly what the next human or system
   needs, or does it generate future input clutter?

10. Protect Quality
   What eval, test, validator, review, or rollback signal tells you
   the cheaper workflow still works?

That last step matters. Reducing token usage is not a spreadsheet sport. If you save 40% on tokens and double the defect rate, congratulations, you have invented a discount incident.

Use tests, evals, traces, and human review samples. Measure quality next to cost. The goal is not minimal tokens. The goal is useful work per token.

What To Change This Week

Start small.

Pick one expensive workflow, not the entire AI strategy. A coding agent loop, a support answer flow, an eval run, a document extraction pipeline, or a nightly issue triage job.

Then make three changes:

Log full-run usage.

Include input, output, cached input, reasoning tokens where available, model, tool calls, retries, and final outcome. You cannot optimize a fog machine.
Move one chunk of work out of the agent.

Use deterministic code, search, a smaller model, batch processing, or precomputed state. Choose the most boring candidate. Boring is where savings like to hide.
Rewrite one output contract.

Replace open-ended explanation with the smallest useful structure. For example: decision, evidence, risk, next action.

Those three changes teach you more than an afternoon of abstract cost anxiety.

Once you can see where tokens go, you can make better tradeoffs: when to spend on a stronger model, when to cache, when to batch, when to isolate work, when to clear stale context, and when to stop asking the agent to do a shell script’s job while dressed as a senior engineer.

The Last Sip

Token usage is not a moral failing. Agents are useful because they can do real work, and real work costs compute.

But pricing changes are making the old comfortable blur less comfortable. The systems that survive this shift will not be the ones that chant “shorter prompts” the loudest. They will be the ones that understand their workflows.

Spend tokens where they reduce uncertainty.

Spend them on the hard judgment, the ambiguous bug, the synthesis, the tradeoff, the careful patch, the explanation a human actually needs. Do not spend them rereading stale context, exposing unused tools, producing ceremonial prose, retrying without new evidence, or running live agents for work that could have slept peacefully in a batch queue.

The token window is an interface. The token bill is feedback.

Design the workflow so both are telling you something useful.

Sources On The Counter

GitHub’s Copilot usage-based billing announcement is the timely signal for why token consumption is becoming visible in everyday agentic coding workflows.
GitHub’s usage-based billing docs and model pricing page explain AI Credits, token categories, and model-based pricing for Copilot.
OpenAI’s prompt caching docs are useful for designing stable repeated prefixes and monitoring cached token counts.
OpenAI’s Batch API docs explain the 50% lower-cost asynchronous path for non-immediate work.
OpenAI’s latency optimization guide supports the practical advice to generate fewer tokens, use fewer input tokens where it matters, make fewer requests, and avoid defaulting to an LLM for deterministic work.
OpenAI Agents SDK usage docs show why agent cost should be measured across the full run, including tool calls and handoffs.
Anthropic’s pricing and prompt caching docs are useful references for cache writes, cache reads, output costs, batch discounts, and model-tier tradeoffs.
Claude Code cost guidance and Anthropic’s agent-loop docs are practical references for stale context, compaction, scoped tools, MCP schema costs, subagents, and lower-effort routine work.