Jun 19, 2026

AI Agents Are Operators. They Need Ergonomics.

Benchmarks test models. Real work tests situated operators: model, harness, tools, permissions, evidence, recovery, and human belief.

Most agent failures are not spectacular.

They are small breaks in the work loop.

The agent says it will check the file, but never calls the tool. It calls a tool, but never verifies the post-state. It creates something somewhere, then gives the human an inaccessible path instead of the usable artifact. It retries the same invalid call because it does not understand the harness boundary. It says “done” while the observable state still says “not done”.

These failures are easy to misclassify as model weakness.

Sometimes they are. But often the model is only one part of the failing system.

In production, an AI agent is not just a model with tools.

It is a situated operator.

The Useful Unit Is Larger Than The Model

The practical unit of analysis is:

model
+ harness
+ tools
+ memory
+ permissions
+ runtime state
+ target environment
+ human obligation
+ evidence channel
+ recovery path

The model reasons inside that system. The harness defines what it can see and do. Tools define possible actions. Permissions define what is allowed. Runtime state defines what is currently true. The target environment defines what can be changed. The human obligation defines why the action matters. The evidence channel defines what counts as done. The recovery path defines what happens when the agent reaches the edge of its competence.

If any of those links break, the agent may still sound intelligent.

It may still fail as an operator.

Tool Use Is Not Work Completion

An agent can use the right tool and still leave the work unfinished.

It can read the right file but edit the wrong branch. It can call the right API but ignore the error. It can generate a plausible answer while never checking the actual state. It can narrate a plan so convincingly that the human believes action happened before any action has occurred.

This is why “tool use” is too weak as a success signal.

In real work, the question is not:

“Did the agent call a tool?”

The question is:

“Did the work state change in the way the human now believes it changed?”

That is the difference between action and evidence.

The Dangerous Output Is Belief

An agent’s final answer is not just text.

In delegated work, it becomes an input into human belief.

If the agent says “the migration is complete”, the human may merge. If it says “the file is ready”, the human may forward. If it says “message sent”, the human may stop monitoring. If it says “the issue was fixed”, the human may close the ticket.

The final claim changes what the human does next.

So the claim must stay inside the evidence.

FINAL CLAIM <= OBSERVABLE EVIDENCE
HUMAN BELIEF <= VERIFIED POST-STATE
AUTONOMY <= EVIDENCE + BOUNDED PERMISSIONS + RECOVERY

These are not slogans. They are operating constraints.

When the claim outruns the evidence, the agent creates false operational belief.

That is one of the most dangerous agent failures because it suppresses recovery. A visible failure preserves the possibility of repair. A false success closes the loop too early.

Harness Awareness Is Part Of Agent Reliability

Human factors has always studied operators inside environments.

Pilots do not act from intelligence alone. They act from instruments, procedures, controls, alarms, crew communication, workload, training, and recovery protocols. Control-room operators do not only “know” the system. They maintain situation awareness through displays, cues, constraints, feedback, and shared protocols.

Agents have a version of this problem.

Their cockpit is the harness.

The harness tells the agent which tools exist, which state is visible, what arguments are valid, where files live, which permissions apply, what the current workspace is, and how execution feedback returns.

An agent with poor harness awareness can produce good reasoning and poor action.

It may invent a command instead of using an existing tool. It may treat a path as a file delivery. It may retry invalid arguments. It may not understand that a tool’s return value is not the user’s final artifact. It may fail to distinguish “I can describe the action” from “I have executed the action”.

That is not only a prompt problem.

It is an ergonomics problem.

Agent Ergonomics

I call this frame Agent Ergonomics: human factors for AI operators.

It studies and designs how AI agents perceive, act, verify, recover, and communicate inside real work environments.

The integrity chain looks like this:

human obligation
-> agent interpretation
-> harness / tools
-> observed action
-> verified post-state
-> final claim
-> human belief
-> decision or recovery

This chain matters to labs and organizations for different reasons.

For labs, Agent Ergonomics turns field failures into product signal. A useful case is not a raw trace dump. It is a reduced artifact:

task
-> environment
-> observed action
-> verified post-state
-> failure family
-> oracle
-> product decision

That is how a real failure becomes an eval case, an oracle, a regression test, or a checkpoint delta.

For organizations, Agent Ergonomics makes delegation supervisable. The question is not “how much autonomy can we give the agent?” The question is “which work loop can tolerate this agent, at this permission level, with this evidence and this recovery path?”

Autonomy without evidence only moves work onto humans.

The human still has to verify, correct, explain, supervise, and assume the result. If that verification work is invisible, the deployment may look efficient while the organization absorbs hidden load.

Benchmarks Are Not Enough

Benchmarks often test decontextualized cognition.

One task. One prompt. One answer. Sometimes one patch.

Agent loops are situated cognition.

Many turns. Tools. Missing files. Bad schemas. Partial states. Human corrections. Runtime surprises. Recovery.

A benchmark can tell you whether a model can solve a task.

It rarely tells you whether the agent can remain a useful operator while the work system changes around it.

That is why a model can feel brilliant in analysis and brittle in production.

Deep reasoning is not the same as situated action.

The Design Target

The goal is not to make agents talk more like humans.

The goal is to make their operation explicit, their claims calibrated, their permissions bounded, and their failures treatable.

For labs, that means better field-derived evals.

For organizations, that means safer human-agent cooperation.

For both, it requires looking past the model and into the work system.

AI agents are operators.

We should design and evaluate them like situated operators.

That is the work of Agent Ergonomics.

Routes

Canonical frame: Labs

Labs route: field failures into eval cases, oracles, and product signal

Organization route: useful, bounded, supervisable agents

This analysis comes from the field observatory — replayable cases, oracles, and failure families documented on Labs.