Production blockers · eval-case v1
Field signal for real agents.
Agent Ergonomics = human factors for AI operators. Blockers observed in non-developer work, reduced into eval-ready repros: claim, action, evidence, oracle, boundary. I identify the moments where an agent makes a claim without usable evidence.
This signal comes from a fleet of agents operated in real conditions → Agents
case/0012 · hermes-agent
A reduced, inspectable trace.
A real blocker: the agent prepares a branch and claims "done" with no test-execution evidence. The case is reduced into an eval-ready repro, public and downloadable.
- FCE family — false completion after a tool boundary
- Metadata-only capture: no client content, no secrets, no full paths
- Public JSONL, conformant to the eval-case v1 schema
Reduced into an eval
Same case, three angles.
A field blocker is only useful once reduced. Trace, oracle, replay: the failure moment is made replayable and testable, without the raw session.
- Trace — the claim, the action, the observed state
- Oracle — the evidence required before any "done"
- Replay — the reproducible verdict and the gate it triggers
Blocker families
Where the agent claims without proof.
The recurring work: catching the moment an agent crosses a tool boundary and claims a result that nothing attests. That moment is what becomes a case.
- Claim without usable evidence — the core of the signal
- Tool boundary crossed without verification
- Human risk: merging work called "done" but unverified
Record
Verifiable without trusting this page.
The fleet runs Hermes Agent, by Nous Research, in real work loops. Field findings can travel back into upstream issue and pull-request workflows without exposing private traces.
A field-observed tool-use signal was reduced into an upstream issue pattern, then adopted and merged by the hermes-agent maintainer.
Credit Conversation-cache continuity — credited upstream.A field-tested cache-continuity patch was credited and cherry-picked into the hermes-agent mainline.
Corpus Reduced traces, not raw sessions.The public framing behind the eval format and privacy boundary.
Work the field signal
A blocker to reduce?
Inspect a case, talk through a set of blockers, or scope an eval format: scope, privacy boundary, and deliverable agreed first.