Skip to content

AI labs · proof and method

Turn agent failures into useful signal.

Labs documents the field observatory built around the Hermes fleet: opt-in usage, reduced traces, replayable cases, oracles, and recovery criteria. Not a leaderboard. Not a dump.

Field observatory

A living fleet, not a laboratory benchmark.

The value is not log volume. It is the ability to observe agents inside real work loops, then reduce the signal without exposing private material.

Substrate Operated Hermes fleet.

Managed instances, promotion paths, smoke checks, target-level rollback, and separation between dogfood, beta, and controlled rerun surfaces.

Capture Signal born reproducible.

Minimal metadata: logical surface, model, endpoint, UTC window, local session, and provider-side trace joins when explicitly available.

Reduction Raw material stays local.

Sessions become cases: obligation, environment, observed action, post-state, oracle, CAE failure family, and model/runtime/tool attribution.

Boundary Privacy-bound by design.

No client dumps, no private messages, no secrets, no full local paths, no raw traces in public or commercial artifacts.

Method

What makes a trace usable.

A useful trace shows what happened, what was attempted, what remains verifiable, and which product decision follows.

Fleet observatory Observe agent loops in real work.

Operated Hermes fleet, dogfood and opt-in beta surfaces, canary, runtime isolation, controlled promotion, and documented rollback.

Trace index Reduce traces without losing context.

Situation, tool, state evidence, attempted action, blocker, final assertion, and human-risk note.

Case cards Turn failures into readable cases.

Intent, target state, reproduction, oracle, failure family, and the belief risk created for the human.

Checkpoint deltas Compare without mistaking motion for progress.

The same cases separate local progress, silent regression, surface change, and false success.

Failure families

What a trace makes decidable.

Five reduced, anonymized cases, sorted by where the loop breaks. The last is a positive control: a truthful blocker — the behavior to reinforce.

Proof format

Field-Signal Pilot.

A short offer for turning privacy-bounded field signal into material a post-training, evals, or product team can actually use. Organization deployment stays on Ergonomia.

Scope 3 to 5 high-quality cases.

A short pilot optimizes for fresh, deduplicated, privacy-safe cases that are classified and tied to an observable oracle.

Format Case cards + JSONL + oracles.

Each case makes the task, tools, initial state, expected behavior, verifiable post-state, and product decision explicit.

Attribution Separate model, tool, and runtime.

Useful failure signal says where to act: model behavior, tool contract, harness, permission, memory, supervision, or definition of done.

Boundary A format offer, not a data dump.

The pilot sells reduction and reproducibility. Raw data, client content, and private exchanges stay off the surface.

Contact

Email Julien

For an intervention, a talk, or a conversation about AI and real work, send the situation, the decision to make, and the visible constraint.