</> Coffee With Humans
Warm Coffee With Humans style illustration showing messy stacks of documents, logs, and tool cards being curated into a clean AI workbench with selected context cards, a test result, trace panel, and BrewBot.

Coffee With Humans

Context Engineering Is Working Memory Design

Context engineering is not prompt polish. It is the discipline of giving AI systems the right working set at the right moment, then keeping that context fresh, small, and verifiable.

On this page

Most AI failures do not arrive wearing a little sign that says insufficient context.

They arrive as something more familiar. The coding agent changes the wrong file. The support bot quotes an outdated policy. The research agent buries the answer under twenty pages of “helpful” background. The internal assistant has the right document somewhere in its retrieval system, but somehow answers like it learned billing from a fortune cookie.

The tempting response is to add more.

More instructions. More examples. More documents. More tools. More chat history. Bigger context window. Bigger model. Bigger sigh.

Sometimes that helps. Often it just turns the model’s workspace into the kind of desk where you can find three old coffee mugs, no keyboard, and a tax receipt from 2019.

That is why “context engineering” has become a useful term. Not because AI needed another phrase to put on conference slides. We had enough of those. It is useful because it points at a real engineering problem:

The model can only act on the working set you give it.

Context engineering is the discipline of designing that working set.

Prompting Is The Note. Context Is The Room.

Prompt engineering is still useful. Clear instructions matter. Good examples matter. Output formats matter. Naming the task instead of waving vaguely at the model and hoping for jazz hands absolutely matters.

But agents and serious AI applications are no longer just one prompt and one answer. They read files, call tools, retrieve documents, accumulate logs, remember preferences, write scratch notes, hand work to subagents, and run checks. Every one of those things can become part of the model’s context.

Anthropic describes context engineering as the broader work of curating and maintaining the useful tokens available during inference: not only the prompt, but also tools, external data, message history, MCP-connected systems, and the state that builds up while an agent works. LangChain’s docs make the same idea feel more mechanical: engineers control model context, tool context, and lifecycle context as the agent loops between model calls and tool execution.

In plain engineer:

Prompt engineering asks: "What should I say to the model?"
Context engineering asks: "What should the model be able to see, use, remember, and verify right now?"

That second question is larger, and much more interesting.

It is also where many production failures hide.

The Workbench, Not The Warehouse

A useful mental model is to split the world into four places.

Warehouse: everything the system could know.
Workbench: what the model sees for this step.
Ledger: what must persist after this step.
Gauge: what tells the system whether the step worked.

The warehouse is huge. It includes your docs, source code, tickets, database rows, runbooks, logs, user history, API schemas, previous agent traces, Slack lore, and that one markdown file named final-final-really-final-v3.md.

The workbench is small. It is the context window for the current model call: the instructions, selected messages, retrieved snippets, tool definitions, relevant state, and recent observations the model actually receives.

The ledger is durable state. It holds things the system should not have to rediscover: user preferences, decisions already made, failed approaches, summaries, constraints, checkpoints, and task progress.

The gauge is feedback. Tests, build output, screenshots, traces, eval scores, tool errors, structured validation, and human approvals all tell the agent whether its last move helped.

Most context mistakes are category mistakes. We put the warehouse on the workbench. We leave the ledger blank and hope the model remembers. We forget the gauge, so the model has no signal except “this answer feels plausible.” Then we blame the model for acting like someone trying to repair a laptop under a pile of unopened mail.

Context engineering is the habit of asking which place each piece of information belongs.

Bigger Windows Do Not Make Context Free

Long context windows are genuinely useful. They let models inspect more code, read more documents, carry longer conversations, and solve tasks that used to require awkward chunking.

They do not make context free.

OpenAI’s prompting guidance points out the basic constraint: models have finite context windows, and relevant outside information can be added through techniques like RAG or file search. The word to notice is relevant. A bigger window gives you more room. It does not guarantee that every token inside the room receives the attention you wish it did.

Chroma’s 2025 context-rot report is a helpful caution here. In controlled tests across 18 models, they found that performance can become less reliable as input length grows, especially when ambiguity, distractors, or full conversation histories force the model to retrieve and reason at the same time. The practical lesson is not “never use long context.” That would be silly. The lesson is that long context is not a garbage disposal.

If a model needs one failing test, the recent diff, the component under test, and the relevant schema, giving it the whole repository plus six months of chat history may make the task harder, not easier. You have turned one debugging problem into two problems:

  1. Find the relevant evidence.
  2. Use the evidence correctly.

Sometimes the model can do both. Sometimes it can, until Friday afternoon, when the logs contain three convincing distractors and everyone is tired.

Better context engineering reduces the amount of retrieval the model has to do inside the model call. It puts the right thing on the bench before asking for careful work.

The Four Moves: Write, Select, Compress, Isolate

LangChain’s context-engineering overview groups the work into four useful moves: write, select, compress, and isolate. That is a good enough toolbox to keep by the espresso machine.

Write Down What Should Survive

Agents forget because we let important information live only in the conversation stream.

If the user says, “Use the new checkout flow, but do not touch the legacy subscription path,” that constraint should probably become state. If an agent investigates three theories and rules two out, those dead ends should be recorded. If a research agent finds that one source is outdated, that fact should not disappear after compaction.

Writing context means persisting useful information outside the immediate context window. This could be a scratchpad, task plan, memory store, issue comment, trace annotation, local file, database record, or structured state field.

The trick is not to write everything. That just moves the mess to a different closet.

Write the things that change future decisions:

  • goals
  • constraints
  • user preferences
  • accepted assumptions
  • rejected approaches
  • evidence found
  • open questions
  • checkpoints reached
  • verification results

For coding agents, this is why a living plan can be more valuable than a long chat. For support agents, it is why the user’s region, tier, and policy version matter more than the entire transcript. For operations agents, it is why “we already checked DNS and it passed” deserves to survive.

Select What Helps The Next Decision

Selection is retrieval with taste.

RAG is one version of selection: find relevant documents and include them in the model request. But context selection is broader than vector search. It includes choosing which tools to expose, which prior messages to keep, which files to read, which examples to include, which state fields to surface, and which output format to require.

A good selection question is:

What does the model need to decide the next useful action?

Not:

What might possibly be related if we squint heroically?

Imagine a coding assistant fixing a failing checkout test. Weak context selection gives it the whole repo, every failing log line, package metadata, old chat, and a vague “please fix.” Better context selection gives it:

  • the failing test and assertion
  • the command that reproduces it
  • the recent diff
  • the checkout component or route
  • the relevant schema or API contract
  • project conventions for similar tests
  • a clear permission boundary

That is not smaller because small is morally pure. It is smaller because it is closer to the task.

Selection also applies to tools. Anthropic’s tool-writing guidance is useful here: tools are not just backend functions, they are part of the interface between deterministic systems and non-deterministic agents. A tool name, schema, description, and response all become model-visible context. A giant ambiguous tool list is context clutter. A tiny tool list that hides necessary capability is context starvation.

The goal is not “more tools” or “fewer tools.” The goal is the right affordances, visible at the right moment, with responses that tell the model what changed and what to do next.

Compress Without Losing The Plot

Compression is where context engineering gets dangerous in a quiet way.

Summaries feel responsible. They reduce tokens. They make long tasks continue. They also have a talent for preserving the conclusion while losing the evidence.

Bad compression says:

Investigated checkout issue. Problem likely in payment validation.

Better compression says:

Checkout investigation:
- Reproduced with `npm test checkout.spec.ts`.
- Legacy subscription path is out of scope per user.
- Ruled out tax calculation: mocked totals match expected values.
- Current hypothesis: payment validation rejects saved cards missing `billingAddress.country`.
- Next useful files: `PaymentForm.tsx`, `checkoutSchema.ts`, `savedCards.fixture.ts`.

The second version is longer, but it is much more compact in the way that matters. It preserves decisions, constraints, evidence, and next actions.

Cognition’s long-running agent guidance makes this point from another angle: actions carry implicit decisions. If you compress away the decisions behind the actions, later work can become inconsistent. The agent remembers that something happened, but not why it happened. That is how you get a beautiful refactor that violates the one constraint the user cared about.

Compress history into a useful state transfer, not a vibes-based recap.

Isolate Noise, Preserve Handoffs

Isolation means keeping some work out of the main context so it does not flood the model.

Subagents are one form. Sandboxed code execution is another. Filesystems, state objects, scratchpads, and specialized tools can all isolate token-heavy or messy work. A research system might let separate workers investigate subtopics, then pass only cleaned findings to a lead synthesizer. A coding agent might inspect large files through targeted search and head/tail style commands instead of pasting everything into one call.

Isolation is powerful because not every detail deserves to sit in the main thread.

But isolation has a cost: coordination.

Cognition warns that naive multi-agent splitting can lose nuance because subagents may make incompatible assumptions. Microsoft’s Azure SRE Agent writeup tells a similar production-flavored story: many specialized subagents and handoffs looked elegant, then created discovery problems, prompt fragility, loops, and tunnel vision as the system scaled.

So isolation needs a handoff contract.

If one agent, tool, or environment does work outside the main context, the handoff should preserve:

  • what it was asked to do
  • what it inspected
  • what it found
  • what it ruled out
  • what assumptions it made
  • what evidence supports the result
  • what should happen next

Clean outputs are good. Clean outputs without provenance are just confident rumors in a nice JSON jacket.

Verification Is Context Too

One of the most underappreciated forms of context is feedback.

Anthropic’s Claude Code best practices make this concrete for coding agents: give the agent a check it can run. A test suite, build command, linter, screenshot comparison, or fixture diff gives the model a signal it can read and react to.

Without verification, the model has to infer whether it is done from the shape of its own answer. That is not a workflow. That is a mirror with a progress bar.

With verification, the next model call receives better context:

Test failed:
Expected saved card billing country to default to account region.
Received undefined in PaymentPayload.

Now the model has a grounded observation. It can inspect the relevant schema, patch the transform, rerun the test, and continue. The verification result becomes part of the context loop.

This matters beyond coding. A support agent can validate that the cited policy version matches the user’s region. A data agent can run a query and inspect row counts. A document agent can render output and compare it to layout expectations. An operations agent can check whether an alert cleared before declaring victory.

Good gauges turn context engineering from “what should we tell the model?” into “what can the model learn from the world?”

Barista’s Tip: The Context Audit

When an AI workflow fails, run this audit before reaching for a larger model or a larger context window.

1. Need
   What exact decision or action does the model need to make next?

2. Missing
   Is the required information absent from the system entirely?

3. Retrieval
   Does the system retrieve the needed context, or a noisy cousin of it?

4. Size
   Is the model receiving far more context than the decision requires?

5. Conflict
   Are there stale, contradictory, or distracting instructions in the window?

6. State
   What should be written to durable memory instead of carried in chat?

7. Compression
   Did summaries preserve decisions, constraints, evidence, and open questions?

8. Isolation
   Is subwork isolated enough to stay focused, but documented enough to hand off?

9. Tools
   Are tool names, schemas, descriptions, and outputs clear to the model?

10. Verification
   What concrete signal tells the model whether its last move worked?

The point is not to worship minimal context. Minimal context can be just as bad as maximal context if it omits the one fact that matters. The point is fit.

Useful context is a small-enough superset of what the model needs.

The Last Sip

Context engineering sounds fancy, but the practical version is humble. It is the work of keeping the model’s desk clean enough to think and stocked enough to act.

Do not paste the warehouse into the prompt and call it rigor. Do not compress a decision trail into soup. Do not split work across agents and then act surprised when nobody knows why the dragon is in the database migration. Do not give an agent a tool response that says success: false when it could say exactly what failed and what input shape it expected.

The context window is not just a limit. It is an interface.

Design it the way you would design any important interface: with clear inputs, relevant state, useful affordances, careful boundaries, and feedback that helps the next step get better.

Better context will not make weak models magical. But it will stop making capable models solve your task while blindfolded in a storage unit.

Sources On The Counter