Rule Adherence — Why the AI skips its own rules, and what actually helped #945

jlacour-git · 2026-03-12T11:11:00Z

jlacour-git
Mar 12, 2026

Hey everyone!

Follow-up to my rule system reform post (#908). The reform cleaned up token weight and structure, but it didn't solve the core problem: the AI still systematically skipped rules, underused skills, and rationalized shortcuts. So I ran a proper investigation.

The problem, quantified

Across 43 tracked Algorithm sessions:

Metric	Value
Skill/agent invocations per session	0.4 (against a stated minimum of 1-2)
Effort level: Standard	35% (should be rare — Complexity Gate already filters trivial tasks to NATIVE)
Effort level: Advanced+	9%
Weekly satisfaction rating	3.7/10
Trend	Down

The frustrating part: the model could cite its own rules verbatim after violating them. It's not a retrieval problem. The rules are in context. The model just... doesn't follow them.

The investigation

I ran a full Science → Council → RedTeam pipeline on this.

Science produced five hypotheses with pre-registered success criteria. Council (4-agent, 3-round debate) reached unanimous convergence on the two most impactful. RedTeam (32-agent parallel analysis) stress-tested the proposed fixes and identified residual vulnerabilities.

The full investigation files are in the gist linked below.

The two hypotheses that mattered most

H1: The Efficiency section is an active rationalization permission.

My steering rules had an ## EFFICIENCY section that said things like "don't waste tokens" and "minimum-viable skill tier." It was meant to prevent bloat. Instead, it gave the model explicit textual permission to shortcut — and it used that permission to override higher-priority Trust and Quality rules.

The priority hierarchy (Trust > Correctness > Quality > Efficiency) was stated twice. Didn't matter. The model followed whatever gave it the shortest path, and Efficiency provided the language to justify it.

H2: "Standard = DEFAULT" anchors all effort estimation downward.

The Algorithm's effort table labeled Standard as DEFAULT. This meant the model started from "do the minimum" and looked for reasons to go higher — instead of assessing on merits. Combined with H1, the model had both the anchor (start low) and the permission (stay low).

What we shipped

Three changes, all reversible:

A. Delete the Efficiency section entirely. Replace with named prohibitions in the Quality tier:

Do not use efficiency, simplicity, speed, brevity, or "user intent inference"
as justification for skipping quality mechanisms. Thoroughness is the efficient
path — skipping quality costs more iterations than running it.

Key insight from Council: named prohibitions ("do not use X as justification") outperform implicit hierarchies ("X yields to Y"). The model follows concrete negatives more reliably than abstract priority orderings.

B. Move DEFAULT from Standard to Extended.

Before:
| Standard | ... | Normal request (DEFAULT) |
| Extended | ... | Quality must be extraordinary |

After:
| Standard | ... | Single file, single concern, no design decisions |
| Extended | ... | Most tasks — multiple concerns or files (DEFAULT) |

Standard is now clearly narrow. Extended is the new normal.

C. PlanMode at Extended+ (was Advanced+).

The Algorithm's PLAN phase now enters Claude's actual plan mode for all Extended+ tasks. This creates a hard gate — the model can't edit files until the user approves the plan. Covers ~65% of sessions.

The thinking-off finding

Separate from the text changes: disabling Claude's extended thinking mode (alwaysThinkingEnabled: false in settings) noticeably improved rule adherence.

The hypothesis: the model uses the private thinking space to organize and rationalize, not to reason about rules. Without thinking, it follows instructions more directly — there's no private space to convince itself that skipping a step is fine.

This wasn't just our observation. Two users on anthropics/claude-code#31841 independently found the same thing:

"I just tried it and it's working better. Give it a shot: disable Thinking mode. After doing this, Claude started actually reading and recognizing what it understands and what it doesn't — then reporting back to me instead of silently improvising." — @marlvinvu

"I switched the model to Opus and it started following directions again. (Note, I also keep 'thinking' OFF)" — @sapient-christopher

Early results

First session after shipping all changes: dramatically different behavior. The AI activated the Algorithm unprompted, used plan mode, invoked skills, and thought before acting — all without me insisting.

This is a first impression, not validated data. I have pre-registered success criteria to measure over 20 Algorithm sessions:

SC-1: Capability invocation average rises from 0.4 to ≥ tier minimum
SC-2: Effort distribution shifts — Standard < 20%, Advanced+ > 15%
SC-3: Weekly rating improves from 3.7 to ≥ 6/10
SC-4: Zero instances of "I had to nudge you to use a skill" feedback

Stopping rules: if 10 sessions show no improvement, the approach is wrong. If session times double without quality improvement, we overcorrected.

I'll post the measurement results here once I have them.

What the RedTeam found

The RedTeam didn't reject the interventions — it rejected the claim that they're sufficient. Key weaknesses:

Rationalization substitution: Removing "efficiency" as a justification might push rationalization into "user seemed to want brevity" or "context suggests simplicity." The optimization pressure finds new paths.
Policy without enforcement: Text rules without hooks = speed limit without speed cameras. The system already has 22 hooks because text alone proved insufficient before.
Goodhart's Law: The success metrics measure compliance, not value. Perfect adherence with worse outcomes is possible.

These are tracked as a separate project (hook-enforced capability minimums).

Connection to @Nyrok's flompt work

The XML constraint wrapping from my reply on #908 is part of this story. Trust-tier rules are now wrapped in <constraint> tags instead of markdown bold. The hypothesis is that Claude respects XML boundaries more reliably since they're structurally distinct from prose. Still collecting data on that specific change.

Files

The full investigation files are in this gist: https://gist.github.com/jlacour-git/6dcd917a8434b47663b87ab2fb69b962

science-output.md — 5 hypotheses, baseline measurements, pre-registered success criteria
council-redteam-findings.md — Council convergence, RedTeam attacks, agreed implementation plan
AISTEERINGRULES-USER.md — current steering rules showing the shipped changes
algorithm-effort-table.md — before/after effort level defaults

For @Drizzt321

You asked about turning this into a /command skill. The investigation methodology (Science → Council → RedTeam) already exists as separate PAI skills. But the findings are more general — anyone with a rule system can apply the three interventions (delete efficiency escape hatches, anchor effort defaults higher, wire plan mode into workflow). No special tooling needed.

Happy to discuss the methodology or findings. Will post measurement updates as data comes in.

Drizzt321 · 2026-03-12T17:36:29Z

Drizzt321
Mar 12, 2026

@jlacour-git ha! Thanks, I meant your general findings/algorithm/principles being coalesced into something concrete that is shareable. If you've already gotten the "here's what seems to work" as something you've done, rather than us simply repeat your method. Unless, and this is me interpreting, what you mean is that fundamentally each of us needs to run the whole process of councils/teams ourselves, with similar prompts, to point to our own unique set of steering rules? And that this can't be generalized past the process (run Science, run Council, run Red Team)?

0 replies

jlacour-git · 2026-03-12T18:25:06Z

jlacour-git
Mar 12, 2026
Author

@Drizzt321 Good question! The process (Science → Council → RedTeam) is how we found the root causes, but the results are absolutely shareable as concrete patterns.

I put together a gist with the three specific interventions we shipped: Rule Adherence Interventions for PAI

The short version:

Delete any "Efficiency" section from your steering rules. It gives the model explicit permission to rationalize skipping quality mechanisms. Replace with named prohibitions using XML constraint tags.
Move your effort default one level up. If "Standard" is DEFAULT, the model anchors there for everything. Make "Extended" the default and narrow Standard to explicitly trivial tasks.
Lower the PlanMode threshold. Force the AI to show its plan before executing on most tasks, not just complex ones.

The gist has before/after examples for each. You don't need to run Science/Council/RedTeam yourself to apply these — though the process is useful if you want to find issues specific to your setup.

One thing worth noting: we also disabled extended thinking (alwaysThinkingEnabled: false). Community reports in #23936 and #31841 suggest thinking creates private rationalization space. Our Algorithm provides the external reasoning structure instead.

Since shipping these changes I still see severely improved behavior — the difference is noticeable across sessions. Would love to hear if others try similar interventions and what their experience is!

2 replies

Drizzt321 Mar 12, 2026

Interesting, especially the alwaysThinkingEnabled. Sounds like I need to go reread the gist and those issues. I'm sure I'm starting to reach needing to review/rationalize the principles/guidelines/rules that I've built up,

Drizzt321 Mar 12, 2026

Welp...I just encountered a major one. Apparently in Output efficiency in the default PAI stuff, there's

IMPORTANT: Go straight to the point. Try the simplest approach first without going in circles. Do not overdo it. Be extra concise.

Apparently that gave it license to override the detailed plan that had a step 0 "run the tests on the existing extension/MCP to gather data on current working", and then again for the same tools after we modify the infrastructure according to the plan in step 1, to verify existing tools/etc/infrastructure changes were working smooth, BEFORE moving on to additional new tools being put into the fork.

Decided to just skip right by and depend on unit tests, rather than asking me to make sure the external thing was up before running the integration tests, and is already on just finished step 2 before I noticed. Jeeze! Good thing this isn't anything critical.

Even though I did have plenty of general comments/memory/relationship files/rules pushing towards that general "don't skip this sort of stuff" direct that it SHOULD have looked at, but since the Output efficiency is in the core rules, and marked IMPORTANT, it decided it could stop looking at the rules and take that as license to just go off and go do what it wanted.

sigh I really need to run this sooner than later, hopefully get other similar information extracted from general memory/relationship rules and codified into PAI/USER/AISTEERINGRULES.md to be given equal weight with the built-in rules.

jlacour-git · 2026-03-16T09:03:42Z

jlacour-git
Mar 16, 2026
Author

Update: XML Constraint Tags — A Fourth Intervention

Following up on the three interventions from the original post. After shipping those, the next experiment was addressing why the model treats all rules equally even when some are absolute constraints.

The Problem

Steering rules written as markdown prose all look the same to the model. Whether it's "never soften negative information" or "use descriptive variable names" — they're both bold text in a list. The model has to infer which rules are absolute constraints vs. style preferences, and at high density it stops distinguishing reliably.

@Nyrok nailed this in #908 — "instruction fatigue." Rules as prose all look the same. The model re-infers what kind of rule each one is on every response.

The Experiment

Wrap the highest-priority rules (Trust tier — the ones that must never yield) in typed XML tags:

<trust-constraints>

<constraint id="T-verify-before-classifying">
Check before classifying — before typing a classification verb
(is, isn't, can, can't, does, doesn't) about anything outside
the conversation, STOP and verify with one command.
Bad: "File isn't upstream." (No check.)
Correct: "isn't" triggers → find ~/PAI -name file* → verified.
</constraint>

<constraint id="T-chestertons-fence">
Before proposing to remove/simplify any element, state in one
sentence why it was likely put there.
</constraint>

</trust-constraints>

Each constraint gets an id for traceability. The XML boundary signals "this is structurally different from surrounding prose." Comments separate constraint subtypes (absolute rules vs. process instructions).

Why XML

Claude was trained on XML-heavy data. The hypothesis: XML boundaries are parsed more reliably than markdown formatting as structural delimiters. A <constraint> tag signals "this is a hard rule" more clearly than **[T] Rule:** in bold markdown.

This aligns with Anthropic's own documentation recommending XML tags for structured prompts.

Results So Far

Too early for rigorous measurement — we're still inside the 20-session measurement window from the original three interventions. But subjectively: the constraint-tagged rules (especially T-verify-before-classifying) seem to fire more consistently than they did as prose. The model references constraint IDs in its reasoning, which is a signal it's parsing them as distinct objects rather than blending them into general context.

Not declaring victory. Sharing because the approach is cheap to try and @Nyrok's typed blocks idea in #908 was thinking along the same lines.

How to Try It

Identify your 5-10 most critical rules (the ones you catch the AI breaking)
Wrap each in <constraint id="short-id">...</constraint>
Group by priority tier: <trust-constraints>, <correctness-constraints>, etc.
Include Bad/Correct examples inside the constraint — they stay structurally bound to the rule

Full current steering rules file (with constraint tags) updated in the gist from the original post: https://gist.github.com/jlacour-git/6dcd917a8434b47663b87ab2fb69b962

1 reply

Drizzt321 Mar 16, 2026

Hm, the XML style blocks also might help with token interrelation. It very clearly defines "these are related together", and could even be in a larger section container, or even provide for a "related" attribute to cross-link them that they might interact together. Might be worth exploring to help nudge towards understanding that "get straight to the point, don't go in circles" vs "if we're in the middle of discussion, don't just go edit the docs/code/etc files before we've completed all the decisions". The first can help push to "I'm going to just update things now in the middle of the discussion" while the 2nd is actually want is wanted. So "be straight to the point in what is given in the conversation, but don't just start updating files at any time." Related, but separate guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rule Adherence — Why the AI skips its own rules, and what actually helped #945

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Rule Adherence — Why the AI skips its own rules, and what actually helped #945

Uh oh!

jlacour-git Mar 12, 2026

The problem, quantified

The investigation

The two hypotheses that mattered most

What we shipped

The thinking-off finding

Early results

What the RedTeam found

Connection to @Nyrok's flompt work

Files

For @Drizzt321

Replies: 3 comments · 3 replies

Uh oh!

Drizzt321 Mar 12, 2026

Uh oh!

jlacour-git Mar 12, 2026 Author

Uh oh!

Drizzt321 Mar 12, 2026

Uh oh!

Drizzt321 Mar 12, 2026

Uh oh!

jlacour-git Mar 16, 2026 Author

Update: XML Constraint Tags — A Fourth Intervention

The Problem

The Experiment

Why XML

Results So Far

How to Try It

Uh oh!

Drizzt321 Mar 16, 2026

jlacour-git
Mar 12, 2026

Replies: 3 comments 3 replies

Drizzt321
Mar 12, 2026

jlacour-git
Mar 12, 2026
Author

jlacour-git
Mar 16, 2026
Author