
Why Your AI Agent Needs an Engineer's Playbook, Not a Bigger Context Window
The most effective AI agents don't succeed because they're smarter. They succeed because someone gave them the same working habits that make human engineers reliable.
There's a seductive idea in the AI industry right now: if we just make the context window bigger, agents will finally work.
More memory. More tokens. More room to think. Surely that's the bottleneck — if the agent could just remember everything, it would stop making the same mistakes, stop losing track of what it was doing, stop declaring the job done when it's only half-finished.
It's a reasonable intuition. It's also wrong.
Anthropic's engineering team recently published a fascinating piece on what actually makes long-running AI agents effective. Their finding wasn't about model capability or context length. It was about engineering discipline — giving agents the same working habits that make human software engineers reliable. Progress logs. Clean handoffs. Incremental commits. End-to-end testing before moving on.
The punchline: the fix for unreliable agents wasn't a better model. It was a better process.
We think this is the most important insight in AI engineering right now. And it maps directly to something we've been arguing for years: the gap between an impressive demo and a reliable production system is almost entirely filled by infrastructure.
What Actually Goes Wrong
When you ask a capable model — even a frontier model — to build something complex across multiple sessions, two failure modes dominate.
The one-shot trap. The agent tries to do everything at once. It charges ahead, implements six features simultaneously, runs out of context mid-way through, and leaves the next session with a half-built mess. The incoming agent has to guess at what happened and spends its entire window cleaning up instead of building.
The premature victory lap. After a few sessions of real progress, a fresh agent looks around, sees code that appears functional, and concludes the project is finished. Features that were outlined but never implemented quietly disappear from the plan. The agent marks the task complete and moves on.
If you've managed human engineers, both of these should sound familiar. The junior developer who tries to refactor an entire codebase in one pull request. The contractor who delivers a project that looks complete until you test the edge cases.
The solution for humans is well-understood: clear handoff documentation, incremental progress, code review, and testing before declaring victory. What Anthropic demonstrated is that the exact same solution works for AI agents.
The Human Engineering Analogy
Imagine a software project staffed by engineers working in shifts. Every eight hours, a new developer sits down at the keyboard. They've never seen this codebase before. They don't know what the previous developer was working on, what's broken, what's been tested, or what the priorities are.
This is exactly how most AI agents operate across context windows. Each new session is a fresh start with no memory of what came before.
Now imagine the same shift-based project, but with a proper engineering process:
- A progress log that every developer reads at the start of their shift and updates at the end
- A feature list with clear priorities and completion states
- Git history with descriptive commits so the incoming developer can trace what changed and why
- A startup checklist: run the dev server, verify existing features work, then — and only then — start on something new
- End-to-end tests that must pass before any feature is marked complete
The second team would outperform the first dramatically. Not because the developers are better. Because the process prevents the most common failure modes.
This is exactly what Anthropic built. Their "effective harness" isn't clever prompt engineering or a novel architecture. It's an engineering playbook — the same kind of structured workflow that any well-run development team follows. And it transformed their agents from unreliable to genuinely productive.
The Infrastructure Lesson
Here's where it gets interesting for us.
We've spent years making the case that reliable AI is an infrastructure problem, not a model problem. That the teams who invest in data foundations, evaluation frameworks, and ground truth engineering outperform the teams that chase the latest model release. That models are commodities, but infrastructure is a competitive moat.
Anthropic's work on agent harnesses is the same argument, applied to a different layer of the stack.
Their agents didn't fail because Claude wasn't smart enough. The model was perfectly capable. The agents failed because the surrounding infrastructure — the harness, the environment, the process — didn't support reliable execution over time. Once they fixed the infrastructure, the same model delivered dramatically better results.
Sound familiar? It's the same dynamic we see in every enterprise AI project. The model works fine in a demo. Then it meets real data, real scale, real edge cases — and the absence of proper infrastructure turns a capable model into an expensive liability.
"The model is the ceiling. The infrastructure is the floor. And most AI systems fail because nobody looked down."
Three Principles That Transfer
Anthropic's specific implementation was for autonomous coding agents, but the principles generalise to every AI system that needs to work reliably over time.
1. Make Progress Incremental and Observable
Anthropic's agents were instructed to work on one feature at a time and commit progress after each one. This prevented the one-shot trap and created a trail of observable, reversible changes.
The enterprise equivalent: break complex AI workflows into discrete, measurable steps. Don't build a monolithic pipeline that ingests raw data and produces a final answer. Build a chain of stages — ingestion, validation, enrichment, retrieval, generation, evaluation — where each stage produces observable output and can be independently monitored.
This is standard platform engineering. It's also the reason well-designed data pipelines are dramatically more reliable than monolithic ones. Each stage is a checkpoint. If something goes wrong, you know exactly where.
2. Define "Done" Before You Start
The feature list that Anthropic's initialiser agent creates at the start of a project isn't just task management. It's a contract. Each feature has a clear description and a binary state: passing or failing. Agents are explicitly forbidden from editing the feature definitions — they can only change the status.
This is ground truth for agent behaviour. The agent knows what "done" looks like before it writes a single line of code. It can't redefine success to match what it's already built.
We apply the same principle at the data layer. A golden dataset defines what "correct" looks like for your AI system. Your evaluation framework measures whether the system meets that standard. The system can't pass by lowering the bar — the bar is defined externally and maintained independently.
Without this, you get the premature victory lap. The agent (or the AI system, or the team) declares success based on what looks complete rather than what is complete.
3. Verify Before You Move On
The most telling finding in Anthropic's work: agents that were prompted to use browser automation for end-to-end testing caught bugs that unit tests and code inspection missed. The agent thought the feature worked. The test showed it didn't. Without that verification step, the agent would have moved on, and the bug would have compounded.
This maps directly to our evaluation maturity model. Most enterprise AI teams are at Level 0 — they "tried it a few times and it seemed fine." The teams that reach Level 2 and 3 — automated evaluation in CI/CD, continuous evaluation in production — are the ones whose systems actually work reliably.
Verification is expensive. It's slow. It's boring. It's also the difference between "the demo went well" and "the system works."
Why This Matters Now
AI agents are about to become a standard part of the enterprise toolkit. Not someday — now. Every major model provider is shipping agent capabilities. Every enterprise is evaluating where agents can automate complex, multi-step workflows.
The question isn't whether enterprises will deploy agents. It's whether those agents will work reliably. And the answer, as Anthropic's research demonstrates clearly, depends almost entirely on the infrastructure around the agent — not on the agent itself.
The teams that treat agent deployment as a model selection problem will discover what every team discovers: the model is the easy part. The hard part is the playbook — the progress tracking, the state management, the evaluation checkpoints, the clean handoffs, the governance framework that ensures the agent is actually doing what you think it's doing.
We've seen this story before. It's the same story we've been telling about data pipelines, evaluation frameworks, and ground truth engineering. The technology is ready. The infrastructure usually isn't.
The model doesn't need a bigger context window. It needs a better engineering process. And building that process? That's infrastructure work. It's unglamorous. It doesn't demo well.
It's also the only thing that works.

