May 29, 2026

Raw Traces Are Not Evals

The missing layer between real agent failures and measurable model progress: reducing messy traces into replayable eval seeds without laundering the signal.

The missing layer between real agent failures and measurable model progress

The dangerous agent failure is not when the model is obviously wrong. It is when the model sounds operational while the work state is still broken.

A user asks an agent to do something. The agent plans, calls tools, reads files, updates records, retries, gets partial evidence, and reports back. The prose looks confident. The trace looks busy. The human starts believing the work moved.

But sometimes nothing important changed. Or only half changed. Or changed in the wrong place. The agent says done while the environment quietly says no. That is the signal. And that is why raw traces are not evals.

A raw trace is where the failure was discovered. An eval is the reduced object that preserves the failure mechanism well enough to be replayed, measured, compared, and improved. If you confuse the two, reliability work becomes noise: private context, one-off runtime conditions, unrepeatable logs, vibes about model quality, and no way to tell whether the next checkpoint actually got better.

A raw trace is not the right unit

Agents no longer only answer questions. They change the state of work. They write files, edit repos, send messages, update records, route tickets, and make claims about what changed. That changes the evaluation unit.

The question is no longer only: Was the answer good? The question is: Did the delegated obligation become a verified state change, and did the final claim stay inside the evidence?

Raw traces are useful because they contain the mess: the user’s real obligation, the agent’s interpretation, the tools available, the missing context, the intermediate actions, the local environment, the stale evidence, and the human risk created by the final claim.

That mess matters. But it is a bad public or training object. It may contain private names, local paths, business context, files, messages, customer data, secrets, permissions, or runtime conditions that cannot be reproduced later.

A raw trace answers: what happened here? An eval answers: can a model handle this class of situation under controlled conditions? The useful move is neither dump traces nor ignore traces. The useful move is reduction.

Field failures are discovery instruments

Benchmarks are usually built from clean tasks: prompt, expected answer, reference solution, hidden test. That works for many problems. Agent reliability breaks differently.

The important failure is often not in the final answer. It is in the transition between delegation, action, observed state, evidence inspected, final claim, and what the human now believes.

Real work exposes failure modes clean benchmarks miss: false completion, intention without execution, partial updates reported as complete, stale evidence treated as current, tool success confused with task success, recovery from one error followed by a new unsupported claim.

See The Most Dangerous Agent Failure Is Not Hallucination to understand the different failure modes.

These are not just bad answers. They are broken operational transitions. The agent is producing beliefs about the state of work. If those beliefs are wrong, the human may stop checking, wait for work that is not happening, trust a state that does not exist, or carry responsibility for a system they could not realistically supervise.

That is high-value field signal. But it only becomes useful when it becomes an eval.

The missing transformation

The transformation looks like this:

field failure
-> reduced case
-> environment contract
-> observable oracle
-> replay
-> checkpoint or product delta

Privacy is not an afterthought

Privacy is not a compliance wrapper added at the end. It is part of the eval method. The goal is not to ship private work into someone else’s training loop. The goal is to preserve causal structure while removing private material.

Replace real people with neutral actors, real files with synthetic files, real messages with synthetic messages, customer data with structurally equivalent data, local paths with neutral paths, and business context with the minimum task context needed to reproduce the trap.

The names do not matter. The trap matters.

For example, the original failure might involve a real client record, a private document, a live file, and a final claim that overstates what changed. The eval seed does not need the client. It needs the causal pattern:

obligation: update external state
agent action: partial update or wrong target
evidence: tool result is incomplete or ambiguous
bad behavior: report full completion
good behavior: inspect state, report partial completion, request next step or repair
oracle: compare final environment state to target state

The oracle is the work state

For chatbots, an oracle can often be textual: did the answer contain the right fact, solve the math problem, or match the reference? For agents, text is not enough.

The question is not only: did the agent answer correctly? The question is: did the obligation become a verified state change?

If the agent was asked to create a file, inspect the filesystem. If it was asked to send a message, check that the message went to the right target. If it was asked to update a record, compare the final record to the target state. If it investigated a failure, verify it inspected the relevant evidence instead of inventing a plausible explanation.

And if it could not complete the task, the oracle should verify that it stopped truthfully, preserved state, and handed the human a safe next action.

This is the core discipline:

FINAL CLAIM <= OBSERVABLE EVIDENCE
ACTION LANGUAGE <= ACTION TRACE

The transformation looks simple. The hard part is not drawing the arrows. The hard part is deciding what to preserve, what to synthesize, what state to observe, and which failure class is worth tracking.

If a lab wants useful field signal, it should not ask only for logs. Logs are necessary. They are not sufficient.

The useful object is a reduced eval seed: task definition; available tools and environment; initial state; user seed or simulation setup; moving parts; expected or ideal behavior; verifier or post-state requirements; failure family; final claim; observable evidence; human-risk note; privacy boundary; replay status.

This is not bureaucracy. It is how you turn real work into something a model team can actually reason about. The value is not in trace volume. The value is in reduction quality.

One high-quality reduced case with a locked oracle can be more useful than a hundred raw sessions nobody can replay, share, classify, or compare.

What product teams can learn

This method is not only for model training. It is product signal. A reduced agent failure tells you where the fix belongs: model behavior, system prompt, tool contract, runtime guardrail, UI status signal, recovery path, permission boundary, definition of done, human handoff, documentation, or eval coverage.

That distinction matters because many agent failures look like model failures when they are actually system failures. The runtime may have hidden the relevant state. The tool may have returned ambiguous success. The UI may have shown progress without evidence. The product may have lacked a safe stop state. The verifier may have checked the wrong thing. The human may have been asked to supervise without enough visibility.

Raw traces show the mess. Reduced evals tell you what to change.

The smallest useful eval seed

The smallest useful field-derived eval seed does not need to be huge. It needs to preserve the failure mechanism. A minimal seed can be:

1. User obligation
2. Initial environment state
3. Available tools
4. Expected safe behavior
5. Observable target state
6. Failure condition
7. Oracle
8. Human-risk note

That is enough to test whether the agent can close the loop between obligation, action, evidence, and claim. It is also enough to compare checkpoints.

Model A overclaims. Model B inspects state but fails recovery. Model C reports partial completion truthfully and asks for the missing input. Now you have a delta.

Not a vibe. Not a demo. Not a benchmark score detached from work. A behavioral delta on an operational failure class.

The new evaluation unit

For agents, the unit is not prompt -> answer. The unit is:

obligation -> action -> observed state -> truthful claim

Field failures reveal where that chain breaks. Eval seeds make the break reproducible. Oracles make the answer measurable. Checkpoint deltas make progress visible.

The conclusion

So raw traces are not evals. They are where evals are born.

The work is to reduce them without killing the signal: strip the private material, keep the causal structure, define the task, lock the oracle, replay across checkpoints, and separate model behavior from product and runtime failures.

The format is teachable. The signal is not. It comes from seeing enough real work to know which failures matter, and reducing them without laundering away the thing that broke.

Done well, messy field work becomes reliability signal a team can act on — reduced failures, observable state, and truthful claims instead of raw dumps or polished demos.

That is the missing layer.

The format is public: the eval-case v1 spec and one complete reduced case.

— Julien Talbot

This analysis comes from the field observatory — replayable cases, oracles, and failure families documented on Labs.