AI labs & software products

Turn agent and AI-software failures into useful signal.

I work with AI labs, product teams, and companies building software where AI changes delegation, verification, interpretation, or responsibility. The entry point is not software in general. It is the real activity the system transforms, supports, or makes more fragile.

What I bring to agent and product teams

Claim-Action-Evidence Discipline

Most agent evaluations stop too early: tool call, plausible answer, or task-looking output. I help teams evaluate whether the final claim is supported by a real action and recent, relevant, scope-bound runtime evidence.

Runtime self-awareness

Agents often misrepresent their own runtime: tools, memory, permissions, files, profiles, logs, and boundaries. I help define cases and guardrails that force agents to inspect the actual execution surface instead of answering from priors.

Human belief risk

A final answer changes the human's operational model. The hard production risk is not ordinary failure; it is a false success, false blocker, wrong target state, or unsupported capability claim that makes the human stop monitoring too early.

Production-readiness envelopes

I frame readiness as a bounded envelope: task class, user context, toolset, runtime guardrails, fallback policy, autonomy tier, and recovery conditions. That makes checkpoint and product comparisons diagnostic rather than leaderboard-like.

Recent anonymized work

Turning real traces into replayable cases.

Recent work with a post-training team at an AI lab turned production-adjacent agent traces into failure families, replayable cases, oracles, and regression matrices. The goal is to read reliability as activity diagnosis, not as an abstract score.

100
structured candidate cases drawn from real agentic-use situations.
24
replay-ready, oracle-locked cases covering eight daily-use axes.
72
attempts analyzed in a repeated protocol to separate success, truthful blockers, and silent failure.

When to work together

This collaboration is useful when agent, copilot, or product traces show repeated failure families: false completion, runtime misreading, weak recovery after tool errors, wrong target state, or final claims unsupported by observed state.

The output is not another score. It is a diagnostic read: which failures persist, on which surfaces, under which reasoning modes, and which guardrails preserve operational truth.

Start with the work situation

For talks, strategic discussions, or collaboration on AI agents and real work systems, email me directly. If the topic is still fuzzy, the guided request helps shape the useful context.