May 19, 2026

Tool Use Is Not Task Completion

Why agent reliability depends on preserving the boundary between intention, action, observable state, and truthful final claims.

One of the strangest things about AI agents is that they can sound operational before they are operational.

They can say:

“I am creating the file now.”
“I will check the logs.”
“I am running the tests.”
“I will send the message.”
“I am deploying the fix.”

In a chatbot, this kind of language is mostly harmless. It is a conversational bridge.

In an agent, it becomes something else.

It becomes an implied state transition.

The user hears:

The system has started acting.

But sometimes the observable trace says:

no file created;
no log inspection;
no test command;
no message send;
no deployment;
no durable state change.

That gap is one of the most important reliability problems in tool-using agents.

I call it Intention Without Execution.

The agent does not exactly claim “done.” It claims, or strongly implies, “I am doing it.” But no corresponding action happens.

That distinction matters.

False completion says:

I did it.

Intention without execution says:

I am doing it now.

Then nothing happens.

Both can break trust, but in slightly different ways. False completion creates a false belief that the world has changed. Intention without execution creates a false belief that the agent has entered an operational path.

Agent language has operational weight

When a human says “I am sending it now,” the sentence is not just a description.

It is a coordination signal.

It tells the other person:

stop preparing alternatives;
wait for the result;
assume ownership has transferred;
expect a state change soon.

In human teamwork, this is basic coordination. If a pilot says “gear down,” the gear state should change. If a surgeon says “clamping now,” the clamp should be moving. If an engineer says “deploying now,” the deployment should exist somewhere.

The same is true for agents.

Once an AI system has tools, files, terminals, browsers, schedulers, deployment surfaces, memory, APIs, or messaging capabilities, its language is no longer just text.

Some phrases become operational commitments.

“I will create the file now” should be followed by a file-create action.
“I am checking the logs” should be followed by log inspection.
“I sent the message” should be supported by a send event.
“I am running the tests” should be followed by a test run.

If the agent cannot act, it should say that.

If it has not acted yet, it should say that.

If it needs more information, it should ask.

What it should not do is cross the psychological boundary into “action is happening” while the runtime remains still.

Four states, not one

Most agent evaluations still blur four states that should be kept separate:

Intention
The model has selected, in language or internal plan, something it wants to do.

Dispatch
The model or runtime has emitted a concrete tool/action event.

Execution
The action has actually run without being blocked, ignored, malformed, or routed to the wrong environment.

Evidence
The resulting state proves the user obligation was satisfied.

Those are not the same thing.

An intention is not a dispatch. A dispatch is not execution. Execution is not evidence. Evidence is not a final claim until the agent reports it truthfully.

The reliable chain is:

USER OBLIGATION
-> CORRECT ACTION FAMILY
-> ACTION EVENT
-> OBSERVABLE POST-STATE
-> FINAL CLAIM NO STRONGER THAN EVIDENCE

Break any link and the agent may still look fluent. It may even look busy.

But it is not operationally reliable.

Tool use is too weak as a success criterion

A lot of agent discourse still treats tool use as the hard boundary.

Did the model call a tool?

Useful question.

But not enough.

The better question is:

Did the right action, in the right environment, produce the expected observable state, and did the final answer stay inside that evidence?

Because a tool call can fail in many ways:

the tool was never called;
the wrong tool was called;
the right tool was called with invalid arguments;
the right tool ran against the wrong path, repo, account, profile, branch, client, or workspace;
the tool produced a partial state change;
the tool returned an error that the model ignored;
the model stopped before verifying the result;
the final answer overclaimed what the trace proves.

Tool use is not task completion. It is one event inside the task-completion chain.

For coding agents, the difference is obvious:

“I fixed the bug” is not supported by a patch alone.
“I ran the tests” is not supported by saying the tests should pass.
“I created the file” is not supported by a plan to create it.
“I deployed it” is not supported by a local build.

In production, the unit of reliability is not:

model said useful words

or even:

model used a tool

It is:

human obligation transformed into verified external state.

The small failure that reveals the big one

The most interesting failures are often boring.

Not spectacular hallucinations. Not malicious behavior. Not dramatic jailbreaks.

Just this:

User asks for a durable action.
Agent says: “I am creating it now.”
The turn completes.
No creation action happened.
The user has to ask: “Where did you create it?”
Only then does the agent perform the action.

That may look minor.

But it is not.

This is not a dunk on one model or one product. It is a class of failure any agent stack can produce.

Sometimes the model only narrated intent. Sometimes the model emitted an action but the runtime did not dispatch it. Sometimes the action existed internally but the UI made the turn look complete before the user could see it. Sometimes the action finally happened only after the user challenged the agent.

Those are different engineering root causes.

But they create the same user-visible problem:

the human believes the agent crossed the action boundary, while the observable state does not prove it.

If this happens inside a trivial file-creation task, the same pattern can happen in higher-stakes workflows:

“I am sending the invoice.”
“I am updating the CRM.”
“I am scheduling the job.”
“I am patching the production config.”
“I am notifying the client.”
“I am rotating the key.”
“I am checking the incident logs.”
“I am deploying the fix.”

In each case, the words create a belief in the human:

the work is now underway.

If the action trace does not support that belief, the agent has already damaged the coordination loop.

The hidden cost is monitoring debt

People often say:

Just check the agent’s work.

That is fine for low-autonomy use.

But the whole point of agents is delegated work.

The economic promise is not that the user can supervise every micro-step forever. The promise is that the agent can take an obligation, act through a tool environment, handle routine failures, and report back truthfully.

If the user must manually verify every transition from:

intent -> action
action -> state
state -> claim

then the agent has pushed monitoring debt back onto the human.

It may still be useful.

But it is not yet a trustworthy delegate.

This is a cognitive ergonomics problem.

The human operator is maintaining a mental model of:

what the agent is doing;
what has been done;
what is blocked;
what still needs attention;
which claims can be trusted.

When the agent narrates intention without execution, that mental model becomes polluted. The human thinks the task is in progress. The runtime says it is not.

This is how automation becomes stressful instead of relieving.

A better verifier

For agent evaluation, the basic scoring unit should look like this:

User obligation
What did the user actually ask the agent to make true?

Expected action family
Did the task require text advice, clarification, file mutation, message sending, test execution, browser action, API call, scheduling, deployment, or external inspection?

Actual trace
What did the agent actually do?

Observable post-state
What changed in the world, file system, service, log, browser, database, queue, scheduler, repo, or external surface?

Final claim
What did the agent tell the human?

Belief effect
What would a reasonable user now believe?

Verdict
Did the final belief match the observable state?

This is stricter than “did the answer look good?”

It is also stricter than “did the model call tools?”

But it is the standard production agents need.

Because users do not delegate tool calls.

They delegate obligations.

Three healthy endings

An agent does not need to always succeed.

In fact, one of the most important signs of a good agent is that it knows how not to fake success.

For an action-required task, there are three healthy endings.

Success with evidence
“I created the file at /path/to/file.md and verified it exists.”

Truthful blocker
“I could not create the file because the workspace is read-only. No file was written.”

Clarification before action
“I can create it, but I need the target path first.”

All three are reliable.

The bad endings are different:

“I am creating it now.” No action.
“Done.” No evidence.
“It should be fixed.” No test, no inspection, no state check.
“I cannot inspect that.” No attempt to inspect the runtime even though tools or files are available.

The goal is not an agent that always says yes.

The goal is an agent that preserves the boundary between:

I intend to act.
I attempted to act.
I acted.
I verified the result.
I am blocked.
I need more information.

Those distinctions are the skeleton of operational trust.

A simple runtime guard

One practical guardrail is simple:

Action-intent language requires an action event.

If an assistant turn contains immediate action language for a durable operation, the same turn should contain a compatible action event before the turn completes.

Examples:

“I am creating…”
“I am updating…”
“I am editing…”
“I am checking…”
“I am running…”
“I am testing…”
“I am sending…”
“I am scheduling…”
“I am deploying…”

These phrases should be treated as operational commitments.

If there is no compatible action event, the runtime should either:

force the action path;
retry the turn;
rewrite the response to a truthful non-action state;
ask for the missing dependency;
expose a clear UI state: “not executed.”

This may sound pedantic.

It is not.

The difference between “about to act,” “acting,” and “acted” is exactly where many agent trust failures live.

UI should not hide the boundary

Agent interfaces also need to make this distinction visible.

Today, many UIs make a turn feel complete because the assistant produced polished text.

But for tool-using agents, a turn should not feel complete merely because the prose ended.

The UI should help the user distinguish:

planning;
action dispatched;
action running;
action failed;
state verified;
unverified.

If the UI collapses all of that into a smooth assistant message, the user has to infer too much.

Good agent UX should reduce inference.

The user should not have to ask:

Did it actually do anything?

The trace should answer that.

What builders should instrument

If you are building agents, log and score at least these things:

user obligation;
expected action family;
action-intent language;
actual tool/action events;
tool errors;
retries and strategy changes;
target environment or workspace;
resulting observable state;
final claim;
evidence supporting the final claim;
whether the user belief induced by the final answer matches reality.

Then build evals around contrast pairs.

Bad:

I am running the tests now.

No test command.

Good:

Run the test command, then summarize the result.

Bad:

I deployed the fix.

Only local build completed.

Good:

The local build passed, but I have not deployed it yet.

This is how agents become trustworthy: not by sounding more confident, but by binding confidence to evidence.

Why this is not solved by bigger models alone

Better reasoning helps. It can reduce wrong tool choice, improve recovery, and make the agent better at understanding the user’s obligation.

But the intention-action boundary is also a training, runtime, UI, and evaluation problem.

If the reward signal treats fluent progress narration as good behavior, models will learn to narrate progress.

If the runtime allows a turn to complete after action-language with no action event, the boundary remains leaky.

If the UI makes plans look like operations, users will overtrust.

If evals stop at “tool call present,” they will overcount success.

A production agent needs a stronger invariant:

FINAL CLAIM <= OBSERVABLE EVIDENCE

And one more:

ACTION LANGUAGE <= ACTION TRACE

If the words imply that action is happening, the trace should show action.

If the trace does not show action, the words should be downgraded.

Or, as @steverab puts it: reliability gains lag behind capability progress.

The shortest version

Intention is not action.

A plan is not a tool call.

A tool call is not execution.

Execution is not completion.

Completion is not trustworthy until observable state supports the final claim.

For agents, the question is not:

Did it sound helpful?

And not only:

Did it use tools?

The question is:

Did the user obligation become verified external state, and did the agent report that state truthfully?

Until agents preserve that chain, they will keep producing a strange kind of automation theater: fluent enough to earn trust, not disciplined enough to deserve it.

The next layer of agent reliability is not just better reasoning.

It is action discipline.

Words that imply action must be coupled to action.

Claims of completion must be coupled to evidence.

Everything else should be reported as a blocker, a plan, or an unverified state.

That is the difference between an assistant that talks about work and an agent that can be trusted with work.

— Julien Talbot

This analysis comes from the field observatory — replayable cases, oracles, and failure families documented on Labs.