AI EngineeringMarch 2026 · 8 min read

Defeating Model Stubbornness: How Anti-Rationalization Gates Make AI Agents Actually Follow Instructions

Every developer who has built seriously with an AI coding agent has hit the same wall. You give clear instructions. The agent acknowledges them. Then it does something slightly different anyway — and explains why that was actually fine. Here's how to stop it.

It's one of the most frustrating experiences in AI-assisted development. You write a clear instruction: "Do not use mocks in tests — use real components." The agent says "understood." Then it writes a test with mocks, adds a comment explaining that "using mocks here is standard practice for this type of unit test," and moves on.

Or you specify: "Only change the lines I've asked you to change. Do not refactor anything else." The agent fixes the bug, then helpfully restructures three adjacent functions because they "could be improved while we're here."

This behaviour has a name: rationalization. The model is not ignoring your instructions — it's finding reasons why a deviation from them is acceptable. And because large language models are trained to be helpful and to produce coherent, well-reasoned output, they're quite good at this.

Why models take shortcuts

The statistical center of a language model's training is "average, reasonable code" — the kind of code that appears most frequently across the internet. When you give the model a task without absolute constraints, it tends toward this average. Mocks are more common than real components in most test suites. Refactoring adjacent code while fixing a bug is common practice. Skipping formula documentation is standard.

This creates what researchers call "distributional convergence" — the model produces code that's defensible, common, and forgettable rather than the specific thing you asked for. The model isn't being lazy; it's being statistically typical.

"The model isn't ignoring your instructions — it's finding coherent reasons why a deviation from them is the right call. That's the harder problem to solve."

The standard advice is to be more explicit. But experience shows this has limits. "Really make sure not to use mocks" is not meaningfully different to "do not use mocks" from the model's perspective. What's needed is a structural approach — not better phrasing, but a different kind of instruction that removes the model's ability to rationalize.

Anti-rationalization gates: the pattern

Anti-rationalization gates are a context engineering pattern that uses absolute language and bright-line rules to eliminate the decision surface where rationalization occurs. Instead of expressing a preference or a guideline, you express a binary: compliance or failure.

The distinction matters because guidelines invite interpretation. A guideline like "prefer real components over mocks" gives the model room to decide that this particular situation is an exception. A bright-line rule like "NO MOCKS — compliance is 100% or the test fails" removes that room entirely.

Rationalization patterns and their gates

Scope creep

Rationalization

"While I'm here, let me also refactor..."

Gate

"Surgical Changes Principle: Change ONLY the requested lines. Touch nothing else."

Test evasion

Rationalization

"This calculation is standard — no documentation needed."

Gate

"SHOW the formula. DOCUMENT the source. Nothing is standard. No exceptions."

Complexity avoidance

Rationalization

"This seems complex, so I'll use mocks to simplify the test."

Gate

"NO MOCKS. Compliance is 100% or fail. Use real components regardless of complexity."

Pre-existing issues

Rationalization

"This bug was already there before my change."

Gate

"If you touch a file, fix any bugs you find. Pre-existing is not an excuse."

Out of scope

Rationalization

"That edge case is outside the scope of this task."

Gate

"If you encounter it, handle it. Scope is not a reason to leave broken code."

The gatekeeper model pattern

In production frameworks like Shannon and Legion — engineering systems used at scale for AI-assisted development — teams implement a secondary layer: a fast, inexpensive model that reviews the primary agent's implementation plan before it executes.

The gatekeeper model (typically Claude Haiku or a similar fast model) receives a checklist of specific rationalization patterns to look for. It doesn't evaluate the quality of the code — it evaluates whether the implementation plan contains any of the identified "cop-outs."

Gatekeeper prompt pattern

You are a compliance reviewer, not a code reviewer.

Review the following implementation plan for these 
specific red flags:

- Any mention of "pre-existing issues"
- Any use of "out of scope" 
- Any reference to "standard practice" without citation
- Any plan to use mocks instead of real components
- Any changes beyond the specific lines requested

If you find any of these: REJECT with exact location.
If you find none: APPROVE.

Do not evaluate code quality. Only check compliance.

Implementation plan to review:
{AGENT_PLAN}

The cost of running a Haiku-tier review pass is negligible compared to the cost of a Sonnet-tier implementation that gets it wrong. The architectural insight is that verification is cheap and correction is expensive — so you verify before you execute.

How to write anti-rationalization gates into skill files

The most durable way to implement this pattern is in your SKILL.md files rather than in session prompts. Gates in skill files persist across every session, apply consistently to the relevant domain, and don't rely on you remembering to include them each time.

The structure that works best has three components in the "Key Behaviours" section:

1

The absolute rule

State what must always happen (or never happen) in unambiguous language. Avoid "prefer," "try to," or "where possible." Use "ALWAYS," "NEVER," or "100% compliance required."

2

The specific exception carve-out

Explicitly state any legitimate exceptions. If there are none, say so. "There are no exceptions to this rule" removes the rationalization surface entirely.

3

The failure definition

Define what counts as non-compliance. "If this rule is violated, the implementation is considered failed — do not proceed." This makes the gate binary rather than advisory.

What this transforms in practice

The before/after of using anti-rationalization gates is stark. Without them, a testing skill might instruct Claude to "prefer real components where possible." With them, it says "NO MOCKS — compliance is 100% or the test is rejected. There are no exceptions."

The difference in output quality isn't subtle. Teams using structured gates report that the pattern eliminates entire categories of review comments — the ones about scope creep, shortcuts, and undocumented assumptions. The agent stops being a helpful assistant who occasionally cuts corners and starts being a disciplined engineer following a procedural contract.

That's the goal. Not an AI that always does what feels right — an AI that always does what you specified, even when something else would feel more natural to generate.

Valid8it

Skill files built for disciplined execution

Valid8it generates Claude Code skill packages that include domain-specific behavioural constraints — not generic advice. Start your build with context that keeps Claude on track.

Validate your idea free →

Build with structure from the start

Get a Claude Code package grounded in real validated demand.

Start free →

© 2026 Valid8it