Skip to content

Writing

Making Grok Act: Notes From a Production System-Prompt Fix

Jun 12, 2026 · Agentic AI


At 11 PM, I was auditing a token leak on a client deployment. I handed the task to Grok running inside a Hermes Agent loop.

It narrated for ten minutes.

“I will check the logs now.” It didn’t check the logs. “Let me examine the cron configuration.” No tool call. I corrected it three times. Its final proposal was a budget_guard cron running every 3 minutes — a token-consuming solution to a token-consumption problem.

I’m a cognitive ergonomist; I’ve written elsewhere about why this failure mode is a gulf of execution (Norman, 1986): the operator formulates intent but never bridges to action. This note is about what happened when I stopped describing the problem and patched it — and what I can and cannot claim about the patch.

The patch

GROK_EXECUTION_GUIDANCE is an XML-tagged system block that forbids intent phrases and forces tool-first behavior. The shape (now upstream in hermes-agent’s prompt builder as tool-use enforcement guidance):

<tool_persistence>
- Use tools whenever they improve correctness, completeness, or grounding.
- Do not stop early when another tool call would materially improve the result.
- If a tool returns empty or partial results, retry with a different query or
  strategy before giving up.
- Keep calling tools until: (1) the task is complete, AND (2) you have verified
  the result.
</tool_persistence>

<mandatory_tool_use>
NEVER answer these from memory or mental computation — ALWAYS use a tool:
- Arithmetic, math, calculations → terminal or execute_code
- Hashes, encodings, checksums → terminal
- Current time, date, timezone → terminal
- System state: OS, CPU, memory, disk, ports, processes → terminal
- File contents, sizes, line counts → read_file, search_files, or terminal
- Git history, branches, diffs → terminal
- Current facts (weather, news, versions) → web_search
</mandatory_tool_use>

The principle is older than LLMs: don’t let the operator trust working memory when the actual system state is one observation away (Reason, 1990 — rule-based errors; Baddeley, 1986 — working memory limits).

What I measured

Same task, same session, same model, with the block injected:

MetricBeforeAfter
Wall clock to fix~10 min of narrated planning33.9 s
User corrections30
API callsmany (narration loop)2

One case. Not a benchmark — a repro.

Two corpus-level measures frame why this matters beyond one session:

Corpus: 150+ sessions, 12+ managed Hermes Agent instances (April 2026), metadata-only capture, no client content.

The A/B I designed — and the run I refused to count

A single repro plus corpus correlations is not causal evidence. So I set up a paired A/B on the current runtime:

ConditionPrompt blockRuntime claim guard
C0_current_stocknono
G1_prompt_caedyesno

Same entrypoint, same model and reasoning level, 2 conditions × 14 cases × 3 repeats = 84 attempts.

The first launch produced 42 attempts that all failed with hermes_session_not_found — a harness regression, not model behavior. That run is marked invalid in my notes with the sentence: “must not be interpreted as model behavior.” If you’ve ever read a benchmark paper and wondered how many invalid runs silently became data points, this is how it happens — the failure looked superficially like the model timing out.

The corrected paired run is documented and launched; consolidated deltas are not yet public. So the honest claim hierarchy today is:

  1. Proven: one before/after repro (33.9 s, 0 corrections) with the exact block published above.
  2. Measured: corpus-level feedforward gap (20.7 % vs 6.9 %) and 124 passing tests around the block.
  3. Designed, not yet claimable: the causal A/B delta. When it lands, it lands with the numbers — not before.

That ordering is the whole discipline: final claim ≤ observable evidence.

Where it does not work

The block fixes the gulf of execution. It does not fix:

Upstream trail

Everything here is checkable without trusting me:

The failure isn’t in the code. It’s in the cognition — and the fix is measurable.

— Julien Talbot

This analysis comes from the field observatory — replayable cases, oracles, and failure families documented on Labs.

Author

Julien Talbot, activity ergonomist, founder of Ergonomia. Public talks, field work, writing, AI labs — grounded in real work.

Contact

Email Julien

For an intervention in your organization — diagnosis, AI and agents in real work — go through Ergonomia. For a talk or a conversation, share the situation, the decision at stake, and the constraint you can see.