The five-point harness is PAWL's roundtrip validation object. It is how PAWL validates itself. It does that by forcing source and roundtrip artifacts, and the tools that produce them, against one another across structural, semantic, and, where possible, behavioral checks. When those checks disagree, the pattern of agreement and loss gives directional hints about which layer likely failed. This is what makes the harness PAWL's roundtrip validation object: it turns agreement and disagreement across the roundtrip into evidence about where correctness holds or breaks.
PAWL applies this roundtrip validation pattern to two different kinds of artifacts: full sandbox profiles and regex artifacts. Because a profile is a whole sandbox policy whose meaning is distributed across many operations, predicate structures, compiler effects, and runtime decisions, it has no single clean semantic center, and a regex artifact defines a regular language and therefore has one. That difference in subject is why profiles and regexes need different roundtrip validation shapes. Here are the two harnesses:
- the profile harness in
ir/profile/five_point_harness.py - the regex harness in
ir/profile/regex/five_point_harness.py
They share a role and a common result model. They do not share one unified mechanism.
1 ◄══ normalized eq ══► 3
Source Reversed
│ ↗ │
│ reverse │ compile
▼ ╱ ▼
2 ───────────╯ 4
Source ◄═ structural eq ═► Roundtrip
blob blob
│ │
└ ◄═ behavioral eq ═► ─┘
Read the diagram first as a sequence of artifacts to be produced:
- a source surface
- a compiled source blob
- a reversed or decoded surface
- a roundtrip compiled blob
Given these four artifacts, we can make the following kinds of comparisons unconditionally:
- surface comparison between the source form and the reversed or decoded form
- structural comparison between the source blob and the roundtrip blob
The following may be made under some conditions:
- semantic comparison when the subject admits a clean equivalence check
- behavioral comparison when runtime evidence is available
The numbered list is the artifact generation sequence and the bulleted lists are the comparison bands. A given harness family may realize one band through several concrete checks, and some bands are only available when the subject and evidence support them. That interaction is why the five-point harness produces a localizing result rather than one equivalence claim.
At the conceptual level, both harness families return three things:
checks: the concrete active comparisons that realize the comparison bands described abovestatus: a derived summary using the shared status vocabulary in../pawl/contract/status.pydistinguishing: the payload that explains why a result is not clean, or what special conditions shaped the interpretation
checks is the primary object: it tells you which comparisons were active and
what each one found. status is a compressed readout of that evidence, and
distinguishing is the explanation layer for non-clean or special outcomes.
Typical distinguishing payloads include canonical diff summaries, first
runtime divergence details, and special-case flags such as decode_only or
timeout context.
The checks do not form a strict ladder, and the harness should not be read as "highest passing level wins."
On the profile surface, observed outcomes already show why:
runtime_eqcan pass whilecanonical_eqfailssemantic_eqcan pass whilecanonical_eqfailsop_table_eqcan pass whilesemantic_eqfailssbpl_eqcan pass whileop_table_eqfails
That means there is no safe transitive story such as "if a later check passed, the earlier ones are implied." The harness is intentionally a vector because different surfaces expose different kinds of loss.
The regex harness has a different check structure, so the same exact monotonicity claims should not be projected onto it automatically. The safe shared claim is narrower: neither harness should be mentally collapsed into a single ordered ladder.
In the profile harness, the comparison bands described above are realized through several concrete checks rather than one check per band.
-
The surface band is
sbpl_eq. It compares source and reversed SBPL after source-aware reconciliation and IR-informed filtering. Before normalization, a reconciliation stage (ir/profile/sbpl_reconcile.py) calibrates the reversed text using the known source: it restores source regex forms where the compiler rewrote them to subpath/literal, strips compiler-merged require-not structure when the source has separate allow+deny rules, and normalizes trailing-slash and string-append artifacts. The reconciliation sidecar recordssbpl_eq_before(native reverse surface) andsbpl_eq_after(reconciled), so consumers can distinguish reverser fidelity from reconciliation coverage. The check result inoutcome.jsontrackssbpl_eq_after. -
The structural band is split across
ir_eq,op_table_eq, andgraph_eq. Together these ask whether the roundtrip preserved the important compiled shape of the policy: per-operation subgraph identity, slot classification, and reachable decision structure. The owner code lives inir/profile/ir_builder.pyand../pawl/contract/. -
The semantic band is
canonical_eqandsemantic_eq. These are compare-only normalization surfaces from../pawl/normalize/reversed.pyand../pawl/normalize/policy.py:canonical_eqasks whether normalized rule structures still agree, whilesemantic_eqasks whether the same leaf predicates remain even when grouping changes. -
The behavioral band is
runtime_eq. For profiles it combines PolicyWitness inir/profile/runtime_compare.pywith the file-op oracle in../runtime/oracle/profile.py.
One profile check deserves separate treatment: compiler_loss_adjusted_eq.
This is not another peer comparison surface. It is a derived-view flag used by
downstream classifiers when the remaining structural difference is attributed
to compiler loss or optimization. Its job is to keep "known compiler loss"
distinct from "unknown reverser defect" without changing the raw check vector.
The profile harness can also stop before the full comparison story is available. Two important special cases are built into the owner code:
-
decode_onlyThe harness can decode the source artifact but cannot run a true roundtrip comparison. Message-filter profiles are the main example. -
timeoutThe harness exceeded its allowed evaluation budget before producing a normal result.
Those are still harness outcomes, but the profile harness does not collapse them into a single flat result label.
The regex harness in ir/profile/regex/five_point_harness.py
realizes those same comparison bands differently because its subject is a
regular language rather than a whole policy.
-
The surface band is
sbpl_eq_rawandsbpl_eq. These compare the authored and decoded regex surfaces at raw and canonicalized levels, so syntax preservation and compare-only normalization both matter. -
The structural band is
ir_eq_rawandir_eq. These compare the regex IR recovered from source and roundtrip sides after decode, asking whether the decoded structural representation stayed aligned even when exact surface text changed. -
The semantic band is carried by DFA parity, because DFA equivalence answers whether two patterns accept the same language.
dfa_eq_sourceanddfa_eq_roundtripcompare decoder-derived and IR-derived automata, and the decoder-vs-direct drift checks (decoder_eq_direct_source,decoder_eq_direct_roundtrip) catch cases where the decoder's DFA and the recovered pattern text silently define different languages. -
The behavioral band is optional
runtime_eq. When present, it compares concrete runtime behavior on file-read subjects.
runtime_eq means the runtime witnesses did not find a behavioral disagreement
between source and roundtrip on the runtime targets they could actually use. It
does not mean "the artifact is correct in all cases," and it does not replace
the static comparison surfaces.
For profiles, the runtime story is split:
- PolicyWitness, owned by
ir/profile/runtime_compare.py, is the instrumentedsandbox_checkwitness. It exercises the decision function and gathers supporting evidence, but it is still a decision-function witness, not a full behavior-path execution. - the binary file-op oracle in
../runtime/oracle/profile.pyexercises a narrower set of real file operations undersandbox_init
Those witnesses are combined into one runtime_eq result by the profile
harness. A missing runtime signal can still leave a structurally clean result
as ok. A passing runtime result can still coexist with structural
differences, which is why runtime_eq=True does not collapse the whole harness
into a runtime-only truth model.
For regex, runtime is simpler and narrower. The harness can compare actual behavior on concrete file-read subjects. That makes runtime a direct witness for those subjects, but still not a total substitute for the other checks.
Runtime availability is limited in concrete ways:
- PolicyWitness only covers operation families that have mappings in
_OP_ATTEMPT_MAPinsideir/profile/runtime_compare.py - the file-op oracle only participates for explicit file operations and
self-skips when the IR yields no explicit file ops (
"skipped": "no_file_ops"in../runtime/oracle/profile.py) decode_onlyprofile outcomes never enter the full runtime-comparison path
So runtime_eq=None should be read as "no usable runtime signal was produced
here," not as one specific failure mode. It may reflect witness coverage
limits, unavailable execution prerequisites, or other runtime-side constraints.
runtime_eq=True also needs interpretation. Sometimes it reflects direct
agreement on concrete runtime targets, and sometimes it reflects a thinner
witness path such as PolicyWitness vacuity (vacuous_no_steps) or a skipped
file-op oracle (no_file_ops); the surrounding payload tells you which kind of
runtime support you actually had.
The safe mental model is: runtime is confirmatory evidence. It is not the only kind of evidence the harness trusts, and it is not universally available.
The shared vocabulary comes from ../pawl/contract/status.py, but the profile
five-point harness no longer publishes a flat status field. It emits the check
vector plus distinguishing evidence directly, and downstream consumers derive
whatever view they need from those facts.
For example:
- casefile generation maps the harness output into
outcome.json's three axes (circuit,equality,runtime) - diagnostic and d-ring tooling derive flat views such as
ok,degraded, ormismatchfrom those axes only when a summary label is actually useful
That separation matters. The harness is the evidence-producing layer; flat labels are interpretations.
The regex harness still derives status in its result summary inside
ir/profile/regex/five_point_harness.py.
In prose, the regex rules are:
okmeans the active regex comparison surfaces are cleandegradedmeans the harness still ran, but decode quality, fallback paths, or DFA-equivalence limits weakened the strength of the claimunsupportedmeans the needed capability boundary was not available for the stronger comparison the harness wanted to makemismatchmeans at least one active non-raw comparison surface found a real disagreement
The regex harness does not produce blocked. Its strongest disagreement
signal is mismatch. When downstream profile readers still derive a flat
summary, blocked remains the profile-side label for concrete runtime-backed
non-equivalence.
The important point is not that profile and regex use the same words. The important point is that regex still emits a flat status directly, while profile pushes callers toward the underlying evidence surfaces.
ir/profile/diagnostic_analysis.py is not the five-point harness.
It is a consumer of five-point outputs.
It reads the harness results and adds its own interpretive layer:
- monotonicity flags
- loss type
- mechanism surface
- solution-layer guesses
- directional suggestions
That diagnostic overlay is useful, but it is a different object. The harness produces validation signals. The diagnostic analysis interprets them.
This document is about understanding the five-point harness as a validation
object. It is not the owner document for every subsystem it touches.
Operational guidance about what to do with a given result belongs in working
guidance such as AGENTS.md, not here.
For deeper reading, these are the owner files behind the main surfaces described above:
ir/profile/five_point_harness.pyprofile harness computation and raw comparison surfacesir/profile/sbpl_reconcile.pysource-aware reconciliation forsbpl_eq(regex restoration, deny-merge inversion, trailing-slash cleanup)ir/profile/regex/five_point_harness.pyregex harness computation and status derivationir/profile/runtime_compare.pyPolicyWitness-side runtime equivalence for profiles../runtime/oracle/profile.pyprofile file-op runtime oracle../pawl/normalize/reversed.pycanonical comparison pipeline for source vs reversed SBPL../pawl/normalize/policy.pycanonical policy and semantic-equality surfaces../pawl/contract/status.pyshared status vocabulary../pawl/contract/policy_dag.pycontract DAG types being compared../pawl/contract/compare.pylane-neutral DAG comparison views../pawl/contract/envelope.pyenvelope and structural-hash layer