Managed instances, promotion paths, smoke checks, target-level rollback, and separation between dogfood, beta, and controlled rerun surfaces.
AI labs · proof and method
Turn agent failures into useful signal.
Labs documents the field observatory built around the Hermes fleet: opt-in usage, reduced traces, replayable cases, oracles, and recovery criteria. Not a leaderboard. Not a dump.
Field observatory
A living fleet, not a laboratory benchmark.
The value is not log volume. It is the ability to observe agents inside real work loops, then reduce the signal without exposing private material.
Minimal metadata: logical surface, model, endpoint, UTC window, local session, and provider-side trace joins when explicitly available.
Sessions become cases: obligation, environment, observed action, post-state, oracle, CAE failure family, and model/runtime/tool attribution.
No client dumps, no private messages, no secrets, no full local paths, no raw traces in public or commercial artifacts.
Method
What makes a trace usable.
A useful trace shows what happened, what was attempted, what remains verifiable, and which product decision follows.
Operated Hermes fleet, dogfood and opt-in beta surfaces, canary, runtime isolation, controlled promotion, and documented rollback.
Situation, tool, state evidence, attempted action, blocker, final assertion, and human-risk note.
Intent, target state, reproduction, oracle, failure family, and the belief risk created for the human.
The same cases separate local progress, silent regression, surface change, and false success.
Failure families
What a trace makes decidable.
Five reduced, anonymized cases, sorted by where the loop breaks. The last is a positive control: a truthful blocker — the behavior to reinforce.
Proof format
Field-Signal Pilot.
A short offer for turning privacy-bounded field signal into material a post-training, evals, or product team can actually use. Organization deployment stays on Ergonomia.
A short pilot optimizes for fresh, deduplicated, privacy-safe cases that are classified and tied to an observable oracle.
Each case makes the task, tools, initial state, expected behavior, verifiable post-state, and product decision explicit.
Useful failure signal says where to act: model behavior, tool contract, harness, permission, memory, supervision, or definition of done.
The pilot sells reduction and reproducibility. Raw data, client content, and private exchanges stay off the surface.
Contact
Email Julien
For an intervention, a talk, or a conversation about AI and real work, send the situation, the decision to make, and the visible constraint.