Jun 12, 2026

Making Grok Act: Notes From a Production System-Prompt Fix

An XML system block that forbids intent narration: one repro (10 min → 33.9 s), corpus measures, an honest A/B, and what the patch cannot fix.

At 11 PM, I was auditing a token leak on a client deployment. I handed the task to Grok running inside a Hermes Agent loop.

It narrated for ten minutes.

“I will check the logs now.” It didn’t check the logs. “Let me examine the cron configuration.” No tool call. I corrected it three times. Its final proposal was a budget_guard cron running every 3 minutes — a token-consuming solution to a token-consumption problem.

I’m a cognitive ergonomist; I’ve written elsewhere about why this failure mode is a gulf of execution (Norman, 1986): the operator formulates intent but never bridges to action. This note is about what happened when I stopped describing the problem and patched it — and what I can and cannot claim about the patch.

The patch

GROK_EXECUTION_GUIDANCE is an XML-tagged system block that forbids intent phrases and forces tool-first behavior. The shape (now upstream in hermes-agent’s prompt builder as tool-use enforcement guidance):

<tool_persistence>
- Use tools whenever they improve correctness, completeness, or grounding.
- Do not stop early when another tool call would materially improve the result.
- If a tool returns empty or partial results, retry with a different query or
  strategy before giving up.
- Keep calling tools until: (1) the task is complete, AND (2) you have verified
  the result.
</tool_persistence>

<mandatory_tool_use>
NEVER answer these from memory or mental computation — ALWAYS use a tool:
- Arithmetic, math, calculations → terminal or execute_code
- Hashes, encodings, checksums → terminal
- Current time, date, timezone → terminal
- System state: OS, CPU, memory, disk, ports, processes → terminal
- File contents, sizes, line counts → read_file, search_files, or terminal
- Git history, branches, diffs → terminal
- Current facts (weather, news, versions) → web_search
</mandatory_tool_use>

The principle is older than LLMs: don’t let the operator trust working memory when the actual system state is one observation away (Reason, 1990 — rule-based errors; Baddeley, 1986 — working memory limits).

What I measured

Same task, same session, same model, with the block injected:

| Metric | Before | After | | --- | ---: | ---: | | Wall clock to fix | ~10 min of narrated planning | 33.9 s | | User corrections | 3 | 0 | | API calls | many (narration loop) | 2 |

One case. Not a benchmark — a repro.

Two corpus-level measures frame why this matters beyond one session:

Rate of “text + tool calls in the same turn” across my session corpus: Claude ~20.7 %, Grok ~6.9 %. Grok returns pure reasoning where Claude chains read → run → patch in one response.
The block has since run in 47 sessions, with 124 tests passing in the harness that wraps it.

Corpus: 150+ sessions, 12+ managed Hermes Agent instances (April 2026), metadata-only capture, no client content.

The A/B I designed — and the run I refused to count

A single repro plus corpus correlations is not causal evidence. So I set up a paired A/B on the current runtime:

| Condition | Prompt block | Runtime claim guard | | --- | ---: | ---: | | C0_current_stock | no | no | | G1_prompt_caed | yes | no |

Same entrypoint, same model and reasoning level, 2 conditions × 14 cases × 3 repeats = 84 attempts.

The first launch produced 42 attempts that all failed with hermes_session_not_found — a harness regression, not model behavior. That run is marked invalid in my notes with the sentence: “must not be interpreted as model behavior.” If you’ve ever read a benchmark paper and wondered how many invalid runs silently became data points, this is how it happens — the failure looked superficially like the model timing out.

The corrected paired run is documented and launched; consolidated deltas are not yet public. So the honest claim hierarchy today is:

Proven: one before/after repro (33.9 s, 0 corrections) with the exact block published above.
Measured: corpus-level feedforward gap (20.7 % vs 6.9 %) and 124 passing tests around the block.
Designed, not yet claimable: the causal A/B delta. When it lands, it lands with the numbers — not before.

That ordering is the whole discipline: final claim ≤ observable evidence.

Where it does not work

The block fixes the gulf of execution. It does not fix:

Empty-response hangs — Grok emitting reasoning tokens with zero text. That needs runtime detection and a one-shot nudge, not prompt text.
Plan continuation errors — retrying the exact same failed call. Forcing tool use doesn’t teach adaptation; it can make the same wrong call faster.
Over-tooling — mandatory tool use means the model shells out for arithmetic a human would do in their head. On trivial queries this costs latency and tokens. It’s the right trade in production loops, the wrong one in chat.

Upstream trail

Everything here is checkable without trusting me:

Tool-use enforcement for Grok: proposed in #5531, adopted and merged by the hermes-agent maintainer in #5595 (“Closes #5531”).
xAI prompt caching: proposed in #5548, cherry-picked upstream in #5604 — the PR body credits “Cherry-picked from #5548 by @Julientalbot”.
The case format these observations reduce into: eval-case v1 spec and one complete reduced case.

The failure isn’t in the code. It’s in the cognition — and the fix is measurable.

— Julien Talbot

This analysis comes from the field observatory — replayable cases, oracles, and failure families documented on Labs.