Skip to content

Latest commit

 

History

History
318 lines (251 loc) · 14.9 KB

File metadata and controls

318 lines (251 loc) · 14.9 KB

Five-Point Harness

The five-point harness is PAWL's roundtrip validation object. It is how PAWL validates itself. It does that by forcing source and roundtrip artifacts, and the tools that produce them, against one another across structural, semantic, and, where possible, behavioral checks. When those checks disagree, the pattern of agreement and loss gives directional hints about which layer likely failed. This is what makes the harness PAWL's roundtrip validation object: it turns agreement and disagreement across the roundtrip into evidence about where correctness holds or breaks.

PAWL applies this roundtrip validation pattern to two different kinds of artifacts: full sandbox profiles and regex artifacts. Because a profile is a whole sandbox policy whose meaning is distributed across many operations, predicate structures, compiler effects, and runtime decisions, it has no single clean semantic center, and a regex artifact defines a regular language and therefore has one. That difference in subject is why profiles and regexes need different roundtrip validation shapes. Here are the two harnesses:

They share a role and a common result model. They do not share one unified mechanism.

What The Harness Is Comparing

     1 ◄══ normalized eq ══►  3
  Source                   Reversed
     │                   ↗    │
     │            reverse     │ compile
     ▼             ╱          ▼
     2 ───────────╯           4
  Source ◄═ structural eq ═► Roundtrip
   blob                       blob
     │                         │
     └   ◄═ behavioral eq ═►  ─┘

Read the diagram first as a sequence of artifacts to be produced:

  1. a source surface
  2. a compiled source blob
  3. a reversed or decoded surface
  4. a roundtrip compiled blob

Given these four artifacts, we can make the following kinds of comparisons unconditionally:

  • surface comparison between the source form and the reversed or decoded form
  • structural comparison between the source blob and the roundtrip blob

The following may be made under some conditions:

  • semantic comparison when the subject admits a clean equivalence check
  • behavioral comparison when runtime evidence is available

The numbered list is the artifact generation sequence and the bulleted lists are the comparison bands. A given harness family may realize one band through several concrete checks, and some bands are only available when the subject and evidence support them. That interaction is why the five-point harness produces a localizing result rather than one equivalence claim.

What The Harness Returns

At the conceptual level, both harness families return three things:

  • checks: the concrete active comparisons that realize the comparison bands described above
  • status: a derived summary using the shared status vocabulary in ../pawl/contract/status.py
  • distinguishing: the payload that explains why a result is not clean, or what special conditions shaped the interpretation

checks is the primary object: it tells you which comparisons were active and what each one found. status is a compressed readout of that evidence, and distinguishing is the explanation layer for non-clean or special outcomes. Typical distinguishing payloads include canonical diff summaries, first runtime divergence details, and special-case flags such as decode_only or timeout context.

No Strict Ladder

The checks do not form a strict ladder, and the harness should not be read as "highest passing level wins."

On the profile surface, observed outcomes already show why:

  • runtime_eq can pass while canonical_eq fails
  • semantic_eq can pass while canonical_eq fails
  • op_table_eq can pass while semantic_eq fails
  • sbpl_eq can pass while op_table_eq fails

That means there is no safe transitive story such as "if a later check passed, the earlier ones are implied." The harness is intentionally a vector because different surfaces expose different kinds of loss.

The regex harness has a different check structure, so the same exact monotonicity claims should not be projected onto it automatically. The safe shared claim is narrower: neither harness should be mentally collapsed into a single ordered ladder.

Profile Harness

In the profile harness, the comparison bands described above are realized through several concrete checks rather than one check per band.

  • The surface band is sbpl_eq. It compares source and reversed SBPL after source-aware reconciliation and IR-informed filtering. Before normalization, a reconciliation stage (ir/profile/sbpl_reconcile.py) calibrates the reversed text using the known source: it restores source regex forms where the compiler rewrote them to subpath/literal, strips compiler-merged require-not structure when the source has separate allow+deny rules, and normalizes trailing-slash and string-append artifacts. The reconciliation sidecar records sbpl_eq_before (native reverse surface) and sbpl_eq_after (reconciled), so consumers can distinguish reverser fidelity from reconciliation coverage. The check result in outcome.json tracks sbpl_eq_after.

  • The structural band is split across ir_eq, op_table_eq, and graph_eq. Together these ask whether the roundtrip preserved the important compiled shape of the policy: per-operation subgraph identity, slot classification, and reachable decision structure. The owner code lives in ir/profile/ir_builder.py and ../pawl/contract/.

  • The semantic band is canonical_eq and semantic_eq. These are compare-only normalization surfaces from ../pawl/normalize/reversed.py and ../pawl/normalize/policy.py: canonical_eq asks whether normalized rule structures still agree, while semantic_eq asks whether the same leaf predicates remain even when grouping changes.

  • The behavioral band is runtime_eq. For profiles it combines PolicyWitness in ir/profile/runtime_compare.py with the file-op oracle in ../runtime/oracle/profile.py.

One profile check deserves separate treatment: compiler_loss_adjusted_eq. This is not another peer comparison surface. It is a derived-view flag used by downstream classifiers when the remaining structural difference is attributed to compiler loss or optimization. Its job is to keep "known compiler loss" distinct from "unknown reverser defect" without changing the raw check vector.

The profile harness can also stop before the full comparison story is available. Two important special cases are built into the owner code:

  • decode_only The harness can decode the source artifact but cannot run a true roundtrip comparison. Message-filter profiles are the main example.

  • timeout The harness exceeded its allowed evaluation budget before producing a normal result.

Those are still harness outcomes, but the profile harness does not collapse them into a single flat result label.

Regex Harness

The regex harness in ir/profile/regex/five_point_harness.py realizes those same comparison bands differently because its subject is a regular language rather than a whole policy.

  • The surface band is sbpl_eq_raw and sbpl_eq. These compare the authored and decoded regex surfaces at raw and canonicalized levels, so syntax preservation and compare-only normalization both matter.

  • The structural band is ir_eq_raw and ir_eq. These compare the regex IR recovered from source and roundtrip sides after decode, asking whether the decoded structural representation stayed aligned even when exact surface text changed.

  • The semantic band is carried by DFA parity, because DFA equivalence answers whether two patterns accept the same language. dfa_eq_source and dfa_eq_roundtrip compare decoder-derived and IR-derived automata, and the decoder-vs-direct drift checks (decoder_eq_direct_source, decoder_eq_direct_roundtrip) catch cases where the decoder's DFA and the recovered pattern text silently define different languages.

  • The behavioral band is optional runtime_eq. When present, it compares concrete runtime behavior on file-read subjects.

Runtime Comparison

runtime_eq means the runtime witnesses did not find a behavioral disagreement between source and roundtrip on the runtime targets they could actually use. It does not mean "the artifact is correct in all cases," and it does not replace the static comparison surfaces.

For profiles, the runtime story is split:

  • PolicyWitness, owned by ir/profile/runtime_compare.py, is the instrumented sandbox_check witness. It exercises the decision function and gathers supporting evidence, but it is still a decision-function witness, not a full behavior-path execution.
  • the binary file-op oracle in ../runtime/oracle/profile.py exercises a narrower set of real file operations under sandbox_init

Those witnesses are combined into one runtime_eq result by the profile harness. A missing runtime signal can still leave a structurally clean result as ok. A passing runtime result can still coexist with structural differences, which is why runtime_eq=True does not collapse the whole harness into a runtime-only truth model.

For regex, runtime is simpler and narrower. The harness can compare actual behavior on concrete file-read subjects. That makes runtime a direct witness for those subjects, but still not a total substitute for the other checks.

Runtime availability is limited in concrete ways:

  • PolicyWitness only covers operation families that have mappings in _OP_ATTEMPT_MAP inside ir/profile/runtime_compare.py
  • the file-op oracle only participates for explicit file operations and self-skips when the IR yields no explicit file ops ("skipped": "no_file_ops" in ../runtime/oracle/profile.py)
  • decode_only profile outcomes never enter the full runtime-comparison path

So runtime_eq=None should be read as "no usable runtime signal was produced here," not as one specific failure mode. It may reflect witness coverage limits, unavailable execution prerequisites, or other runtime-side constraints.

runtime_eq=True also needs interpretation. Sometimes it reflects direct agreement on concrete runtime targets, and sometimes it reflects a thinner witness path such as PolicyWitness vacuity (vacuous_no_steps) or a skipped file-op oracle (no_file_ops); the surrounding payload tells you which kind of runtime support you actually had.

The safe mental model is: runtime is confirmatory evidence. It is not the only kind of evidence the harness trusts, and it is not universally available.

Status Semantics

The shared vocabulary comes from ../pawl/contract/status.py, but the profile five-point harness no longer publishes a flat status field. It emits the check vector plus distinguishing evidence directly, and downstream consumers derive whatever view they need from those facts.

For example:

  • casefile generation maps the harness output into outcome.json's three axes (circuit, equality, runtime)
  • diagnostic and d-ring tooling derive flat views such as ok, degraded, or mismatch from those axes only when a summary label is actually useful

That separation matters. The harness is the evidence-producing layer; flat labels are interpretations.

Regex Status

The regex harness still derives status in its result summary inside ir/profile/regex/five_point_harness.py.

In prose, the regex rules are:

  • ok means the active regex comparison surfaces are clean
  • degraded means the harness still ran, but decode quality, fallback paths, or DFA-equivalence limits weakened the strength of the claim
  • unsupported means the needed capability boundary was not available for the stronger comparison the harness wanted to make
  • mismatch means at least one active non-raw comparison surface found a real disagreement

The regex harness does not produce blocked. Its strongest disagreement signal is mismatch. When downstream profile readers still derive a flat summary, blocked remains the profile-side label for concrete runtime-backed non-equivalence.

The important point is not that profile and regex use the same words. The important point is that regex still emits a flat status directly, while profile pushes callers toward the underlying evidence surfaces.

Diagnostic Analysis Is Downstream

ir/profile/diagnostic_analysis.py is not the five-point harness. It is a consumer of five-point outputs.

It reads the harness results and adds its own interpretive layer:

  • monotonicity flags
  • loss type
  • mechanism surface
  • solution-layer guesses
  • directional suggestions

That diagnostic overlay is useful, but it is a different object. The harness produces validation signals. The diagnostic analysis interprets them.

Boundaries and Deeper Reading

This document is about understanding the five-point harness as a validation object. It is not the owner document for every subsystem it touches. Operational guidance about what to do with a given result belongs in working guidance such as AGENTS.md, not here.

For deeper reading, these are the owner files behind the main surfaces described above: