Rule Adherence — Why the AI skips its own rules, and what actually helped #945
Replies: 3 comments 3 replies
-
|
@jlacour-git ha! Thanks, I meant your general findings/algorithm/principles being coalesced into something concrete that is shareable. If you've already gotten the "here's what seems to work" as something you've done, rather than us simply repeat your method. Unless, and this is me interpreting, what you mean is that fundamentally each of us needs to run the whole process of councils/teams ourselves, with similar prompts, to point to our own unique set of steering rules? And that this can't be generalized past the process (run Science, run Council, run Red Team)? |
Beta Was this translation helpful? Give feedback.
-
|
@Drizzt321 Good question! The process (Science → Council → RedTeam) is how we found the root causes, but the results are absolutely shareable as concrete patterns. I put together a gist with the three specific interventions we shipped: Rule Adherence Interventions for PAI The short version:
The gist has before/after examples for each. You don't need to run Science/Council/RedTeam yourself to apply these — though the process is useful if you want to find issues specific to your setup. One thing worth noting: we also disabled extended thinking ( Since shipping these changes I still see severely improved behavior — the difference is noticeable across sessions. Would love to hear if others try similar interventions and what their experience is! |
Beta Was this translation helpful? Give feedback.
-
Update: XML Constraint Tags — A Fourth InterventionFollowing up on the three interventions from the original post. After shipping those, the next experiment was addressing why the model treats all rules equally even when some are absolute constraints. The ProblemSteering rules written as markdown prose all look the same to the model. Whether it's "never soften negative information" or "use descriptive variable names" — they're both bold text in a list. The model has to infer which rules are absolute constraints vs. style preferences, and at high density it stops distinguishing reliably. @Nyrok nailed this in #908 — "instruction fatigue." Rules as prose all look the same. The model re-infers what kind of rule each one is on every response. The ExperimentWrap the highest-priority rules (Trust tier — the ones that must never yield) in typed XML tags: <trust-constraints>
<constraint id="T-verify-before-classifying">
Check before classifying — before typing a classification verb
(is, isn't, can, can't, does, doesn't) about anything outside
the conversation, STOP and verify with one command.
Bad: "File isn't upstream." (No check.)
Correct: "isn't" triggers → find ~/PAI -name file* → verified.
</constraint>
<constraint id="T-chestertons-fence">
Before proposing to remove/simplify any element, state in one
sentence why it was likely put there.
</constraint>
</trust-constraints>Each constraint gets an Why XMLClaude was trained on XML-heavy data. The hypothesis: XML boundaries are parsed more reliably than markdown formatting as structural delimiters. A This aligns with Anthropic's own documentation recommending XML tags for structured prompts. Results So FarToo early for rigorous measurement — we're still inside the 20-session measurement window from the original three interventions. But subjectively: the constraint-tagged rules (especially Not declaring victory. Sharing because the approach is cheap to try and @Nyrok's typed blocks idea in #908 was thinking along the same lines. How to Try It
Full current steering rules file (with constraint tags) updated in the gist from the original post: https://gist.github.com/jlacour-git/6dcd917a8434b47663b87ab2fb69b962 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone!
Follow-up to my rule system reform post (#908). The reform cleaned up token weight and structure, but it didn't solve the core problem: the AI still systematically skipped rules, underused skills, and rationalized shortcuts. So I ran a proper investigation.
The problem, quantified
Across 43 tracked Algorithm sessions:
The frustrating part: the model could cite its own rules verbatim after violating them. It's not a retrieval problem. The rules are in context. The model just... doesn't follow them.
The investigation
I ran a full Science → Council → RedTeam pipeline on this.
Science produced five hypotheses with pre-registered success criteria. Council (4-agent, 3-round debate) reached unanimous convergence on the two most impactful. RedTeam (32-agent parallel analysis) stress-tested the proposed fixes and identified residual vulnerabilities.
The full investigation files are in the gist linked below.
The two hypotheses that mattered most
H1: The Efficiency section is an active rationalization permission.
My steering rules had an
## EFFICIENCYsection that said things like "don't waste tokens" and "minimum-viable skill tier." It was meant to prevent bloat. Instead, it gave the model explicit textual permission to shortcut — and it used that permission to override higher-priority Trust and Quality rules.The priority hierarchy (Trust > Correctness > Quality > Efficiency) was stated twice. Didn't matter. The model followed whatever gave it the shortest path, and Efficiency provided the language to justify it.
H2: "Standard = DEFAULT" anchors all effort estimation downward.
The Algorithm's effort table labeled Standard as
DEFAULT. This meant the model started from "do the minimum" and looked for reasons to go higher — instead of assessing on merits. Combined with H1, the model had both the anchor (start low) and the permission (stay low).What we shipped
Three changes, all reversible:
A. Delete the Efficiency section entirely. Replace with named prohibitions in the Quality tier:
Key insight from Council: named prohibitions ("do not use X as justification") outperform implicit hierarchies ("X yields to Y"). The model follows concrete negatives more reliably than abstract priority orderings.
B. Move DEFAULT from Standard to Extended.
Standard is now clearly narrow. Extended is the new normal.
C. PlanMode at Extended+ (was Advanced+).
The Algorithm's PLAN phase now enters Claude's actual plan mode for all Extended+ tasks. This creates a hard gate — the model can't edit files until the user approves the plan. Covers ~65% of sessions.
The thinking-off finding
Separate from the text changes: disabling Claude's extended thinking mode (
alwaysThinkingEnabled: falsein settings) noticeably improved rule adherence.The hypothesis: the model uses the private thinking space to organize and rationalize, not to reason about rules. Without thinking, it follows instructions more directly — there's no private space to convince itself that skipping a step is fine.
This wasn't just our observation. Two users on anthropics/claude-code#31841 independently found the same thing:
Early results
First session after shipping all changes: dramatically different behavior. The AI activated the Algorithm unprompted, used plan mode, invoked skills, and thought before acting — all without me insisting.
This is a first impression, not validated data. I have pre-registered success criteria to measure over 20 Algorithm sessions:
Stopping rules: if 10 sessions show no improvement, the approach is wrong. If session times double without quality improvement, we overcorrected.
I'll post the measurement results here once I have them.
What the RedTeam found
The RedTeam didn't reject the interventions — it rejected the claim that they're sufficient. Key weaknesses:
These are tracked as a separate project (hook-enforced capability minimums).
Connection to @Nyrok's flompt work
The XML constraint wrapping from my reply on #908 is part of this story. Trust-tier rules are now wrapped in
<constraint>tags instead of markdown bold. The hypothesis is that Claude respects XML boundaries more reliably since they're structurally distinct from prose. Still collecting data on that specific change.Files
The full investigation files are in this gist: https://gist.github.com/jlacour-git/6dcd917a8434b47663b87ab2fb69b962
science-output.md— 5 hypotheses, baseline measurements, pre-registered success criteriacouncil-redteam-findings.md— Council convergence, RedTeam attacks, agreed implementation planAISTEERINGRULES-USER.md— current steering rules showing the shipped changesalgorithm-effort-table.md— before/after effort level defaultsFor @Drizzt321
You asked about turning this into a
/commandskill. The investigation methodology (Science → Council → RedTeam) already exists as separate PAI skills. But the findings are more general — anyone with a rule system can apply the three interventions (delete efficiency escape hatches, anchor effort defaults higher, wire plan mode into workflow). No special tooling needed.Happy to discuss the methodology or findings. Will post measurement updates as data comes in.
Beta Was this translation helpful? Give feedback.
All reactions