Skip to content

Latest commit

 

History

History
2759 lines (2572 loc) · 152 KB

File metadata and controls

2759 lines (2572 loc) · 152 KB

harrier

Harrier is a high-performance Rust hashmap project focused on:

  • SIMD-accelerated control-byte probing
  • two-choice cuckoo placement
  • fast bounded insertion search (BFS displacement)
  • robust rare-path fallback for pathological collision patterns

Current implementations:

  • HarrierMap<K, V>: generic map with indirect SIMD control-byte probing.
  • HarrierU64Map<V>: specialized integer-key map under active optimization.

Status

This repository is currently in active performance iteration.

  • v0.1 goal: minimal but usable map/set API with strong correctness.
  • v0.2 goal: extensive tuning and adaptive policies for state-of-the-art benchmark results.

Current implementation notes

  • HarrierMap uses:
    • indirect SIMD control-byte probing (16-byte groups)
    • 2-choice cuckoo placement
    • BFS relocation on insert pressure
    • rare-path overflow stash fallback
  • Optional reseed hook is available via ReseedableBuildHasher for collision recovery paths.
  • Lookup fast path includes deletion-aware early miss short-circuiting.

Quick start

use harrier::HarrierMap;

let mut map = HarrierMap::new();
map.insert("k", 42);
assert_eq!(map.get(&"k"), Some(&42));
*map.get_or_insert("count", 0) += 1;

For deterministic u64 hashing (e.g. reproducible tests/benchmarks), use the seeded constructor on HarrierU64Map:

# use harrier::HarrierU64Map;
let mut map = HarrierU64Map::with_capacity_and_seed(1024, 0x1234_5678_9abc_def0);
map.insert(1, 10);
assert_eq!(map.get(1), Some(&10));

For build-mostly workloads with guaranteed-unique keys, advanced users can use the unsafe fast path:

# use harrier::HarrierMap;
# let mut map = HarrierMap::with_capacity(128);
for i in 1..=100u64 {
    // SAFETY: keys are unique in this loop.
    unsafe { map.insert_unique_unchecked(i, i * 2); }
}

If capacity is already reserved and you want to skip per-insert growth checks, use insert_unique_unchecked_no_grow:

# use harrier::HarrierMap;
# let mut map = HarrierMap::with_capacity(256);
for i in 1..=200u64 {
    // SAFETY: keys are unique and capacity is pre-reserved.
    unsafe { map.insert_unique_unchecked_no_grow(i, i * 2); }
}

Run tests

cargo test

Run benchmark harness

cargo run --release --bin bench

Optionally tune iterations:

HARRIER_BENCH_ITERS=1000000 cargo run --release --bin bench

You can also tune sample count for median-based reporting:

HARRIER_BENCH_ITERS=200000 HARRIER_BENCH_RUNS=3 cargo run --release --bin bench

To median across repeated benchmark processes (useful for taming process-level jitter), use the helper script knob:

HARRIER_BENCH_PROCESS_REPEATS=3 bash scripts/bench_iter.sh

When process repeats are enabled, the script prints a per-row repeat-stability table (CV/min/max/median across repeats) before writing the median-aggregated TSV. You can tune the table length with:

HARRIER_BENCH_PROCESS_REPEATS=3 HARRIER_BENCH_PROCESS_REPEAT_STABILITY_LIMIT=20 bash scripts/bench_iter.sh

If lookup-path diagnostics are enabled (HARRIER_BENCH_LOOKUP_STATS=1), process-repeat runs also print a Lookup path stability table from stderr logs showing median/CV for key path-mix rates:

HARRIER_BENCH_PROCESS_REPEATS=3 \
HARRIER_BENCH_LOOKUP_STATS=1 \
HARRIER_BENCH_PROCESS_REPEAT_LOOKUP_STABILITY_LIMIT=20 \
bash scripts/bench_iter.sh

If insert phase timing is enabled, process-repeat runs also print an Insert phase stability table based on stderr phase logs:

HARRIER_BENCH_PROCESS_REPEATS=3 \
HARRIER_BENCH_INSERT_PHASE_TIMING=1 \
HARRIER_BENCH_PROCESS_REPEAT_PHASE_STABILITY_LIMIT=20 \
bash scripts/bench_iter.sh

If insert checkpoint timing is also enabled, process-repeat runs print an Insert checkpoint stability table summarizing tail_head_ratio and per-segment CVs across repeats:

HARRIER_BENCH_PROCESS_REPEATS=3 \
HARRIER_BENCH_INSERT_PHASE_TIMING=1 \
HARRIER_BENCH_INSERT_PHASE_CHECKPOINTS=3 \
HARRIER_BENCH_PROCESS_REPEAT_CHECKPOINT_STABILITY_LIMIT=20 \
bash scripts/bench_iter.sh

It also prints a repeat-synchrony table to help identify whether outlier repeats are shared across implementations (environmental) or impl-local. Default outlier mode is robust MAD (HARRIER_BENCH_PROCESS_REPEAT_EVENT_MODE=mad). Tune synchrony output with:

HARRIER_BENCH_PROCESS_REPEATS=3 \
HARRIER_BENCH_PROCESS_REPEAT_SYNC_LIMIT=20 \
HARRIER_BENCH_PROCESS_REPEAT_EVENT_MODE=mad \
HARRIER_BENCH_PROCESS_REPEAT_MAD_Z=3.5 \
HARRIER_BENCH_PROCESS_REPEAT_SPIKE_RATIO=1.15 \
bash scripts/bench_iter.sh

Supported event modes:

  • mad (default): two-sided robust outlier detection around median.
  • ratio: high-side threshold using PROCESS_REPEAT_SPIKE_RATIO.
  • quantile: two-sided tail detection using PROCESS_REPEAT_QUANTILE_CUTOFF.

When multiple comparable process-repeat runs exist, the script also prints a cross-run persistence summary of shared event indices (which repeat positions keep co-spiking across runs):

HARRIER_BENCH_PROCESS_REPEATS=5 \
HARRIER_BENCH_PROCESS_REPEAT_PERSIST_LIMIT=20 \
bash scripts/bench_iter.sh

If phase timing is enabled, it also prints a cross-run phase-alignment persistence table that joins repeat JSON + stderr phase logs and reports how often shared total-cost events align with setup (clear+pretouch) vs measured insert events. When checkpoint timing is enabled, the same table also reports tail_head_ratio shared-rate/overlap columns and:

  • alignment_class (setup_dominant, measured_uniform, measured_tail_skew, tail_skew_only, mixed, none)
  • measured_outlier_shape (setup_dominant, whole_loop_scaling, segment_skew, tail_skew_without_measured_alignment, measured_inconclusive, none) to quickly distinguish segment-skew outliers from whole-loop scaling:

When insert rows are present, the same report now appends # insert_phase_verdict comment lines with a coarse insert algorithm gate: blocked_noisy_shared_signal, blocked_environmental_scaling, allow_segment_targeting, or inconclusive_alignment.

HARRIER_BENCH_PROCESS_REPEATS=5 \
HARRIER_BENCH_INSERT_PHASE_TIMING=1 \
HARRIER_BENCH_PROCESS_REPEAT_PHASE_PERSIST_LIMIT=20 \
bash scripts/bench_iter.sh

To keep each raw repeat TSV (instead of deleting temporary repeat files), add:

HARRIER_BENCH_PROCESS_REPEATS=3 HARRIER_BENCH_KEEP_REPEAT_FILES=1 bash scripts/bench_iter.sh

By default, process-repeat runs also write a consolidated JSON sidecar (<stamp>-<mode>.repeats.json) containing per-row repeat series and summary stats. Disable it with:

HARRIER_BENCH_PROCESS_REPEATS=3 HARRIER_BENCH_REPEAT_JSON=0 bash scripts/bench_iter.sh

Optional warmup loop control:

HARRIER_BENCH_WARMUP_ITERS=50000 cargo run --release --bin bench

Optional timed-insert warmup runs (helps reduce first-allocation noise in insert_new / insert_update benchmarks):

HARRIER_BENCH_INSERT_WARMUP_RUNS=1 cargo run --release --bin bench

Optional per-case internal repeats for insert benchmarks (medianed before each reported sample):

HARRIER_BENCH_INSERT_CASE_REPEATS=3 cargo run --release --bin bench

Optional map reuse mode for insert benchmarks (reuses one pre-allocated map per measured run and clears between case repeats):

HARRIER_BENCH_INSERT_REUSE_MAP=1 cargo run --release --bin bench

Optional insert pre-touch mode (runs an untimed fill+clear before each measured insert_new sample to reduce allocator/page-fault spikes):

HARRIER_BENCH_INSERT_PRETOUCH=1 cargo run --release --bin bench

Optional insert-path diagnostics (prints BFS displacement counters to stderr for harrier/harrier_u64 insert_new benchmarks):

HARRIER_BENCH_INSERT_STATS=1 cargo run --release --bin bench

Optional lookup-path diagnostics (prints primary/secondary probe and key-compare counters to stderr for find_hit/find_miss on harrier and harrier_u64):

HARRIER_BENCH_LOOKUP_STATS=1 cargo run --release --bin bench

Lookup diagnostic fields:

  • primary_group_probes / secondary_group_probes: main and alternate-group probe counts
  • tag_matches: total control-tag candidate matches across probed groups
  • key_comparisons: total key equality checks performed for matched tags
  • secondary_hits: successful lookups resolved in alternate groups
  • overflow_lookups: fallback overflow scans performed after main-table miss
  • overflow_matches: successful lookups resolved from overflow fallback
  • normalized rates:
    • secondary_probe_rate
    • tag_matches_per_lookup
    • key_comparisons_per_lookup
    • key_comparisons_per_primary
    • secondary_hit_rate
    • overflow_lookup_rate
    • overflow_match_rate.

When running through scripts/bench_iter.sh, lookup diagnostics also power:

  • Lookup path stability (repeat-window median/CV over lookup path rates)
  • Lookup stats drift vs previous run (metric deltas vs previous comparable run)
  • Lookup drift signal persistence (rolling counts of stable speedups/regressions across recent comparable lookup-drift sidecars) using scripts/bench_lookup_path_stability.py and scripts/bench_lookup_diff.py (includes ns_delta_pct and a path_mix_stable flag when lookup-path rates are unchanged, plus ns_trend and stable_speedup / stable_regression / signal_class classifications. It also emits relative-vs-hashbrown drift fields (old_rel_vs_hashbrown, new_rel_vs_hashbrown, rel_delta_pct, rel_trend, stable_relative_speedup, stable_relative_regression, rel_signal_class) plus a short summary line counting absolute and relative stable speedups/regressions). Path-mix deltas now include overflow fallback rates (overflow_lookup_rate_delta, overflow_match_rate_delta) so overflow-driven lookup regressions are explicitly classified as path shifts. When previous comparable lookup diagnostics are available, bench_iter.sh also persists this drift table as a sidecar:
  • <stamp>-<mode>.lookup_diff.tsv Tune drift stability sensitivity with:
  • HARRIER_BENCH_LOOKUP_PATH_EPS (default 1e-9). Tune ns/op trend deadzone with:
  • HARRIER_BENCH_LOOKUP_NS_EPS_PCT (default 0.0). Tune persistence table length with:
  • HARRIER_BENCH_LOOKUP_SIGNAL_PERSIST_LIMIT (default 12). The persistence table (scripts/bench_lookup_signal_persistence.py) now reports both absolute and relative-vs-hashbrown stable signal counts (net_stable_score and net_relative_score), and sorts by relative regression persistence first. It also appends # op_relative_summary comment lines that aggregate relative drift persistence by operation (useful for tracking find_hit keep/revert pressure across windows). When at least three lookup drift sidecars are available, bench_iter.sh also prints a Lookup relative verdict table (via scripts/bench_lookup_relative_verdict.py) that summarizes per-impl and combined Harrier relative signal counts for one operation and emits a keep/reject/inconclusive verdict. Tune that verdict with:
  • HARRIER_BENCH_LOOKUP_VERDICT_OP (default find_hit)
  • HARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS (default 3)
  • HARRIER_BENCH_LOOKUP_VERDICT_MIN_NET (default 1)
  • HARRIER_BENCH_LOOKUP_VERDICT_MAX_REGRESSIONS (default 1)
  • HARRIER_BENCH_LOOKUP_VERDICT_MAX_PATH_SHIFT (default 1).

Optional contains-operation benchmarks (contains_hit / contains_miss) can be enabled for lookup-path analysis without value-load costs:

HARRIER_BENCH_INCLUDE_CONTAINS=1 cargo run --release --bin bench

Hit-style operations (find_hit, contains_hit, and find_hit_prehashed) cycle through keys 1..=n, so they remain true-hit workloads even when n is not a power of two.

Optional prehashed lookup benchmarks (find_hit_prehashed / find_miss_prehashed) can be enabled to run lookups with caller-supplied precomputed hashes (for both Harrier and hashbrown):

HARRIER_BENCH_INCLUDE_PREHASHED=1 cargo run --release --bin bench

When run via scripts/bench_iter.sh, prehashed mode also prints a find_* vs find_*_prehashed delta table (via scripts/bench_lookup_decompose.py) so you can quickly inspect how each implementation changes under prehashed lookup mode.

Diagnostic fields:

  • direct_primary_inserts: inserts placed directly in primary group (g0)
  • direct_secondary_inserts: inserts placed directly in alternate group (g1)
  • direct_primary_rate / direct_secondary_rate: normalized direct-placement ratios
  • one_step_searches: insertions that attempted the one-step displacement fast path
  • one_step_slots_scanned: candidate slots inspected while searching one-step moves
  • one_step_duplicate_child_groups: duplicate one-step child alternates skipped within a root-group scan
  • one_step_hits: insertions resolved directly via one-step displacement
  • one_step_slots_per_search and one_step_hit_rate: normalized one-step fast-path efficiency ratios
  • bfs_searches: number of insertions that entered the BFS displacement path
  • bfs_groups_scanned: total BFS parent groups expanded
  • bfs_duplicate_child_groups: duplicate child alternates skipped per parent expansion
  • bfs_displacements: number of moved entries along discovered displacement paths
  • bfs_rate: fraction of inserts that entered BFS search
  • groups_per_search and displacements_per_insert: normalized ratios helpful for correlating runtime outliers with algorithmic behavior.
  • overflow_stashes / overflow_rate: rare fallback count and fraction.

Optional insert phase timing diagnostics (prints clear / pretouch-fill / timed-insert phase costs to stderr for all insert_new implementations):

HARRIER_BENCH_INSERT_PHASE_TIMING=1 cargo run --release --bin bench

Optional finer measured-insert checkpoint timing (split measured insert loop into checkpoints+1 timed segments; emits insert_phase_ckpt diagnostics):

HARRIER_BENCH_INSERT_PHASE_TIMING=1 \
HARRIER_BENCH_INSERT_PHASE_CHECKPOINTS=3 \
cargo run --release --bin bench

insert_phase_ckpt output includes per-segment segN_ns_per_insert plus head_ns_per_insert, tail_ns_per_insert, and tail_head_ratio to help localize whether late-loop segments dominate unstable runs.

When running through scripts/bench_iter.sh, stderr is persisted to benchmarks/results/<stamp>-<mode>.stderr.log.

Optional deterministic generic-hasher mode (reduces run-to-run seed variance for HarrierMap and hashbrown comparisons):

HARRIER_BENCH_FIXED_HASHER=1 cargo run --release --bin bench

Iterative benchmark tracking

Use the helper script to capture timestamped benchmark files and diff against the previous run:

bash scripts/bench_iter.sh

bench_iter.sh takes an exclusive lock in benchmarks/results/.bench_iter.lock when flock is available, so concurrent helper invocations are serialized instead of contaminating each other’s measurements. It also sets PYTHONDONTWRITEBYTECODE=1 for helper Python scripts to avoid generating transient __pycache__ files during benchmark runs.

This writes results to benchmarks/results/*.tsv and prints per-case percent delta vs the previous run in the same hasher mode (-default.tsv vs -fixed.tsv). It also prints a rolling median summary and instability table over the most recent 5 runs with matching metadata (falls back to same-mode runs when no matching metadata is found). Each run writes a sidecar *.meta.json capturing benchmark knob settings (plus a benchmark schema version) for traceability. When metadata is present, the script prefers comparing against the most recent prior run with matching benchmark knobs. If you explicitly want legacy fallback behavior (compare to previous timestamp when no metadata match exists), set HARRIER_BENCH_ALLOW_FALLBACK_PREV=1 (this includes insert-specific knobs such as HARRIER_BENCH_INSERT_CASE_REPEATS, reuse-map, pre-touch settings, insert stats mode, insert phase timing mode, process-repeats mode (including repeat stability/synchrony knobs, event mode/threshold knobs, repeat persistence knob, repeat JSON export mode, repeat phase-stability/persistence limit knobs, and repeat lookup-stability limit knob, repeat checkpoint-stability limit knob, repeat insert checkpoint timing knob, lookup-stats + lookup-path-eps + lookup-ns-eps-pct + lookup-signal-persist-limit + include-contains + include-prehashed knobs, repeat-file retention mode), and targeted filters like HARRIER_BENCH_ONLY_OP / HARRIER_BENCH_ONLY_IMPL).

For less cross-case interference while benchmarking, you can isolate each (n, load) case into a separate process:

HARRIER_BENCH_ISOLATE_CASES=1 bash scripts/bench_iter.sh

For maximum isolation (each operation in a separate process per (n, load)), enable op isolation too:

HARRIER_BENCH_ISOLATE_CASES=1 HARRIER_BENCH_ISOLATE_OPS=1 bash scripts/bench_iter.sh

To isolate each implementation (harrier, harrier_u64, hashbrown) into its own process as well, add:

HARRIER_BENCH_ISOLATE_CASES=1 HARRIER_BENCH_ISOLATE_OPS=1 HARRIER_BENCH_ISOLATE_IMPLS=1 bash scripts/bench_iter.sh

To reduce thermal/time-order bias in isolated mode, shuffle (n, load) case order:

HARRIER_BENCH_ISOLATE_CASES=1 HARRIER_BENCH_SHUFFLE_CASES=1 bash scripts/bench_iter.sh

For reproducible shuffled order across runs, set a shuffle seed:

HARRIER_BENCH_ISOLATE_CASES=1 HARRIER_BENCH_SHUFFLE_CASES=1 HARRIER_BENCH_SHUFFLE_SEED=123 bash scripts/bench_iter.sh

To reduce scheduler noise further, optionally pin benchmark processes to one CPU and/or adjust priority:

HARRIER_BENCH_TASKSET_CPU=2 HARRIER_BENCH_NICE=-5 bash scripts/bench_iter.sh

For targeted insert stability studies, CPU pinning is especially useful (in our isolated insert_new runs it reduced CV from double-digit percentages to sub-1% in repeated samples).

For a standardized pinned lookup diagnostics gate (fixed hasher, process repeats, lookup stats + contains + prehashed ops on the large-case tuple), use:

bash scripts/bench_lookup_gate.sh

All knobs in this helper remain overridable via environment variables. By default it also sets HARRIER_BENCH_LOOKUP_NS_EPS_PCT=0.2 so drift classification treats sub-0.2% ns/op movement as flat, and HARRIER_BENCH_LOOKUP_PATH_EPS=0.002 so tiny path-rate jitter is classified as path-mix stable. It also sets default relative verdict knobs (HARRIER_BENCH_LOOKUP_VERDICT_OP=find_hit, HARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS=3, HARRIER_BENCH_LOOKUP_VERDICT_MIN_NET=1, HARRIER_BENCH_LOOKUP_VERDICT_MAX_REGRESSIONS=1, HARRIER_BENCH_LOOKUP_VERDICT_MAX_PATH_SHIFT=1).

For an automated multi-run lookup decision gate that repeatedly executes the lookup helper and then prints a final relative verdict (keep/reject/ inconclusive) for one operation, use:

bash scripts/bench_lookup_verdict_gate.sh

Useful knobs:

  • HARRIER_BENCH_VERDICT_GATE_RUNS (default 3)
  • HARRIER_BENCH_LOOKUP_VERDICT_LIMIT (default 5; rolling diff files used)
  • HARRIER_BENCH_LOOKUP_VERDICT_SCOPE (default current_window; set to matching_history to compute verdict from the rolling comparable history instead of only the diff sidecars generated by the current gate invocation)
  • HARRIER_BENCH_LOOKUP_VERDICT_OP (default find_hit)
  • HARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS (default 3)
  • HARRIER_BENCH_LOOKUP_VERDICT_MIN_NET (default 1)
  • HARRIER_BENCH_LOOKUP_VERDICT_MAX_REGRESSIONS (default 1)
  • HARRIER_BENCH_LOOKUP_VERDICT_MAX_PATH_SHIFT (default 1)
  • HARRIER_BENCH_VERDICT_FAIL_ON_REJECT (default 0; set to 1 to exit non-zero on a combined reject verdict).
  • HARRIER_BENCH_VERDICT_FAIL_ON_INSUFFICIENT_DATA (default 0; set to 1 to fail when combined verdict is insufficient_data/unknown)
  • HARRIER_BENCH_VERDICT_FAIL_ON_INCONCLUSIVE (default 0; set to 1 to fail when combined verdict is inconclusive)
  • HARRIER_BENCH_VERDICT_FAIL_ON_VERDICT_DEGRADE (default 0; set to 1 to fail when the combined verdict rank worsens versus the previous matching verdict sidecar).
  • HARRIER_BENCH_VERDICT_FAIL_ON_NET_DROP (default 0; set to 1 to fail when the combined net relative score drops versus the previous matching verdict sidecar).
  • HARRIER_BENCH_VERDICT_NET_DROP_MIN (default 1; minimum drop threshold used by ...FAIL_ON_NET_DROP).
  • HARRIER_BENCH_VERDICT_FAIL_ON_PATH_SHIFT (default 0; set to 1 to fail when combined relative path-shift runs exceed the configured max)
  • HARRIER_BENCH_VERDICT_PATH_SHIFT_MAX (default 0; maximum allowed combined relative path-shift runs when fail-on-path-shift is enabled)
  • HARRIER_BENCH_VERDICT_EARLY_STOP_ON_NON_KEEP (default 1; after each gate run once HARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS is reached, compute a provisional combined verdict on currently selected lookup-diff sidecars and stop early when that verdict is already terminal under active fail flags (reject/inconclusive/insufficient-data/unknown))
  • HARRIER_BENCH_CANDIDATE_LABEL (default empty; optional label recorded in verdict sidecars/log output for experiment traceability) The helper also writes a sidecar verdict report next to the latest fixed TSV as <stamp>-fixed.lookup_verdict.tsv. If a previous matching verdict sidecar exists, it also prints and records the combined verdict delta (previous -> current). The verdict sidecar also records the active verdict_scope and the count of selected lookup-diff sidecars used to compute the verdict, plus git_commit, candidate_label, verdict_gate_runs_requested, verdict_gate_runs_completed, verdict_gate_early_stopped, verdict_gate_early_stop_verdict, and combined_path_shift_runs for run traceability. If no lookup-diff sidecars are available/selected, it writes an insufficient_data verdict sidecar and honors HARRIER_BENCH_VERDICT_FAIL_ON_INSUFFICIENT_DATA.

For strict candidate experimentation (auto-fail on reject, inconclusive, insufficient data, verdict degradation, or net-score drops across scoped windows), use:

bash scripts/bench_lookup_candidate_gate.sh

This wrapper defaults to:

  • HARRIER_BENCH_LOOKUP_VERDICT_SCOPE=current_window
  • HARRIER_BENCH_VERDICT_GATE_RUNS=3
  • HARRIER_BENCH_LOOKUP_VERDICT_LIMIT=$HARRIER_BENCH_VERDICT_GATE_RUNS
  • HARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS=$HARRIER_BENCH_VERDICT_GATE_RUNS
  • HARRIER_BENCH_ONLY_OP=find_hit (focused single-op candidate loop)
  • HARRIER_BENCH_INCLUDE_CONTAINS=0
  • HARRIER_BENCH_INCLUDE_PREHASHED=0
  • HARRIER_BENCH_VERDICT_FAIL_ON_REJECT=1
  • HARRIER_BENCH_VERDICT_FAIL_ON_INSUFFICIENT_DATA=1
  • HARRIER_BENCH_VERDICT_FAIL_ON_INCONCLUSIVE=1
  • HARRIER_BENCH_VERDICT_FAIL_ON_VERDICT_DEGRADE=1
  • HARRIER_BENCH_VERDICT_FAIL_ON_NET_DROP=1
  • HARRIER_BENCH_VERDICT_NET_DROP_MIN=1
  • HARRIER_BENCH_VERDICT_FAIL_ON_PATH_SHIFT=1
  • HARRIER_BENCH_VERDICT_PATH_SHIFT_MAX=0
  • HARRIER_BENCH_CANDIDATE_LABEL=lookup-candidate-<git-short-sha> (unless explicitly overridden)

To run the same strict gate focused on a single implementation, use:

bash scripts/bench_lookup_candidate_gate_generic.sh
bash scripts/bench_lookup_candidate_gate_u64.sh

These wrappers set:

  • HARRIER_BENCH_ONLY_IMPL=harrier,hashbrown (generic wrapper) or HARRIER_BENCH_ONLY_IMPL=harrier_u64,hashbrown (u64 wrapper)
  • HARRIER_BENCH_VERDICT_GATE_RUNS=4
  • HARRIER_BENCH_LOOKUP_VERDICT_LIMIT=3
  • HARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS=3 (uses one extra run to produce a full 3 diff sidecars in fresh impl-scoped windows)
  • default labels:
    • lookup-candidate-generic-<git-short-sha>
    • lookup-candidate-u64-<git-short-sha>

To evaluate both impl-focused strict wrappers as one candidate step, use:

bash scripts/bench_lookup_candidate_dual_gate.sh

This runs:

  • bench_lookup_candidate_gate_generic.sh with label suffix -generic
  • bench_lookup_candidate_gate_u64.sh with label suffix -u64

and writes <stamp>-lookup-candidate-dual.tsv with per-scope runner status, combined verdict, net score, and path-shift runs. By default it runs both scopes and then fails if any scope wrapper failed (HARRIER_BENCH_CANDIDATE_REQUIRE_SUCCESS=1). Set HARRIER_BENCH_CANDIDATE_FAIL_FAST=1 to exit immediately on the first failed scope.

For confirmation windows on top of the dual candidate gate, use:

bash scripts/bench_lookup_candidate_dual_recheck_gate.sh

Defaults:

  • HARRIER_BENCH_LOOKUP_DUAL_CONFIRM_WINDOWS=2
  • HARRIER_BENCH_LOOKUP_DUAL_REQUIRE_PASS=1
  • HARRIER_BENCH_CANDIDATE_LABEL=lookup-dual-confirm-<git-short-sha> (appends -w1, -w2, ...)
  • HARRIER_BENCH_LOOKUP_DUAL_RECHECK_REPORT_DIR=benchmarks/results (writes <stamp>-lookup-candidate-dual-recheck.tsv) and records both runner status (wrapper exit) and final dual status

To summarize dual candidate and dual recheck sidecars, use:

python3 scripts/bench_lookup_dual_outcomes.py --kind all --limit 20

Useful filters:

  • --kind dual|dual_recheck|all
  • --label-filter <substring>
  • --limit <N> (-1 keeps all rows)
  • --view raw|latest_label|label_stats (latest_label / label_stats include latest git commit metadata)

For stricter promotion discipline, run confirmation windows that repeatedly invoke the strict lookup candidate gate and require combined_verdict=keep in each window:

bash scripts/bench_lookup_candidate_recheck_gate.sh

This helper defaults to:

  • HARRIER_BENCH_CANDIDATE_CONFIRM_WINDOWS=2
  • HARRIER_BENCH_CANDIDATE_REQUIRE_KEEP=1
  • HARRIER_BENCH_CANDIDATE_LABEL=lookup-candidate-confirm-<git-short-sha> (it appends -w1, -w2, ... per window)
  • HARRIER_BENCH_CANDIDATE_RECHECK_REPORT_DIR=benchmarks/results (writes <stamp>-lookup-candidate-recheck.tsv with per-window outcomes + final pass/fail status)

Useful overrides:

  • set HARRIER_BENCH_CANDIDATE_CONFIRM_WINDOWS higher for deeper confirmation.
  • set HARRIER_BENCH_CANDIDATE_REQUIRE_KEEP=0 to run confirmation windows without enforcing keep (diagnostic mode).

For a standardized pinned insert checkpoint gate (fixed hasher, insert_new, reuse+pretouch, phase timing + checkpoints, process repeats), use:

bash scripts/bench_insert_checkpoint_gate.sh

All knobs in this helper are also overridable via environment variables.

For an automated multi-run insert verdict gate that repeatedly executes the checkpoint helper and then prints the latest # insert_phase_verdict gate classification, use:

bash scripts/bench_insert_verdict_gate.sh

Useful knobs:

  • HARRIER_BENCH_INSERT_VERDICT_GATE_RUNS (default 3)
  • HARRIER_BENCH_INSERT_VERDICT_LIMIT (default 5; rolling repeat windows)
  • HARRIER_BENCH_INSERT_VERDICT_SCOPE (default current_window; set to matching_history to compute verdict from comparable historical windows)
  • HARRIER_BENCH_INSERT_VERDICT_LIMIT_ROWS (default 5)
  • HARRIER_BENCH_INSERT_VERDICT_FAIL_ON_BLOCK (default 0; set to 1 to exit non-zero when verdict is blocked/inconclusive)
  • HARRIER_BENCH_INSERT_VERDICT_FAIL_ON_UNKNOWN (default 0; set to 1 to fail when no insert phase gate could be extracted)
  • HARRIER_BENCH_CANDIDATE_LABEL (default empty; optional label recorded in verdict sidecars/log output for experiment traceability)
  • HARRIER_BENCH_PROCESS_REPEAT_EVENT_MODE / ..._SPIKE_RATIO / ..._MAD_Z / ..._QUANTILE_CUTOFF for persistence event detection mode. The helper also writes a sidecar verdict report next to the latest fixed TSV as <stamp>-fixed.insert_verdict.tsv, including verdict_scope and selected repeat-window counts, git_commit, candidate_label, and (when available) previous gate + gate delta. If no repeat sidecars are available/selected, it writes an unknown gate sidecar and honors HARRIER_BENCH_INSERT_VERDICT_FAIL_ON_UNKNOWN.

For strict insert-candidate gating (auto-fail on blocked or unknown gate), use:

bash scripts/bench_insert_candidate_gate.sh

This wrapper defaults to:

  • HARRIER_BENCH_INSERT_VERDICT_SCOPE=current_window
  • HARRIER_BENCH_INSERT_VERDICT_GATE_RUNS=3
  • HARRIER_BENCH_INSERT_VERDICT_LIMIT=$HARRIER_BENCH_INSERT_VERDICT_GATE_RUNS
  • HARRIER_BENCH_INSERT_VERDICT_FAIL_ON_BLOCK=1
  • HARRIER_BENCH_INSERT_VERDICT_FAIL_ON_UNKNOWN=1
  • HARRIER_BENCH_CANDIDATE_LABEL=insert-candidate-<git-short-sha> (unless explicitly overridden)

For insert-side confirmation windows (re-running strict insert candidate gates and requiring a specific insert_phase_gate each time), use:

bash scripts/bench_insert_candidate_recheck_gate.sh

This helper defaults to:

  • HARRIER_BENCH_INSERT_CANDIDATE_CONFIRM_WINDOWS=2
  • HARRIER_BENCH_INSERT_CANDIDATE_REQUIRE_GATE=allow_segment_targeting
  • HARRIER_BENCH_CANDIDATE_LABEL=insert-candidate-confirm-<git-short-sha> (it appends -w1, -w2, ... per window)
  • HARRIER_BENCH_INSERT_CANDIDATE_RECHECK_REPORT_DIR=benchmarks/results (writes <stamp>-insert-candidate-recheck.tsv with per-window gates + final pass/fail status)

Useful overrides:

  • set HARRIER_BENCH_INSERT_CANDIDATE_CONFIRM_WINDOWS higher for deeper confirmation windows.
  • set HARRIER_BENCH_INSERT_CANDIDATE_REQUIRE_GATE= (empty) to disable gate equality enforcement and use it as a diagnostics runner.

To inspect recent candidate verdict outcomes across lookup/insert sidecars in a single TSV summary, use:

python3 scripts/bench_candidate_outcomes.py --kind all --limit 30

Useful filters:

  • --kind lookup|insert|all
  • --label-filter <substring>
  • --limit <N> (-1 keeps all rows)
  • --view raw|latest_label|label_stats
  • --lookup-scope combined|harrier_combined|harrier|harrier_u64 (lookup rows can be summarized from any verdict scope, default combined)
    • latest_label collapses to newest sidecar per kind+candidate_label with a sample_count column, latest git_commit, and selected lookup scope (plus insert alignment/outlier-shape columns for insert sidecars).
    • label_stats reports per-label outcome counts (keep/reject/...) plus the latest outcome/commit/scope/path for each label (plus insert alignment/outlier-shape fields when available).

To build a cross-script candidate status board (lookup/insert verdicts, recheck outcomes, and dual gate outcomes) use:

python3 scripts/bench_candidate_status_board.py --limit 50

Useful filters:

  • --label-filter <substring>
  • --label-filter-exact (requires --label-filter; use exact label matches instead of substring matching)
  • --label-base-filter <substring>
  • --label-base-filter-exact (requires --label-base-filter; matches normalized base labels, stripping -recheck and timestamp suffixes; when available, eval/autotriage metadata candidate_label_base values are used)
  • --requested-label-filter <substring> (filters on u64_autotriage_requested_candidate_label; useful for confirmation-streak triage where executed labels may include -streakN)
  • --requested-label-filter-exact (requires --requested-label-filter)
  • --last-eval-label-filter <substring> (filters on u64_autotriage_last_eval_label)
  • --last-eval-label-filter-exact (requires --last-eval-label-filter)
  • --first-eval-outcome-filter <substring> (filters on u64_autotriage_first_eval_outcome)
  • --first-eval-outcome-filter-exact (requires --first-eval-outcome-filter)
  • --streak-actionable-before-failure-filter <substring> (filters on u64_autotriage_streak_actionable_before_failure)
  • --streak-actionable-before-failure-filter-exact (requires --streak-actionable-before-failure-filter)
  • --first-eval-actionable-then-failed-filter <substring> (filters on derived u64_autotriage_first_eval_actionable_then_failed=yes|no|unknown)
  • --first-eval-actionable-then-failed-filter-exact (requires --first-eval-actionable-then-failed-filter)
  • --latest-failure-family-filter <substring> (filters on derived u64_autotriage_latest_failure_family)
  • --latest-failure-family-filter-exact (requires --latest-failure-family-filter; combined pipeline/recheck lookup+low-load families accept both canonical ...+low_load_hit_geo_regression and legacy ...+pipeline_low_load_hit_geo_regression / ...+recheck_low_load_hit_geo_regression aliases)
  • --only-first-eval-actionable (keeps rows where u64_autotriage_first_eval_outcome=actionable)
  • --only-first-eval-lookup-not-keep (keeps rows where u64_autotriage_first_eval_outcome contains lookup_not_keep)
  • --only-first-eval-low-load-hit-geo-regression (keeps rows where u64_autotriage_first_eval_outcome contains low_load_hit_geo_regression)
  • --only-first-eval-failed (keeps rows where u64_autotriage_first_eval_outcome starts with pipeline_failed or actionable_recheck_failed)
  • --only-first-eval-actionable-recheck-failed (keeps rows where u64_autotriage_first_eval_outcome starts with actionable_recheck_failed)
  • --only-first-eval-actionable-recheck-failed-lookup-not-keep (keeps rows where u64_autotriage_first_eval_outcome=actionable_recheck_failed_lookup_not_keep)
  • --only-first-eval-actionable-recheck-failed-low-load-hit-geo-regression (keeps rows where u64_autotriage_first_eval_outcome=actionable_recheck_failed_low_load_hit_geo_regression)
  • --only-first-eval-actionable-recheck-failed-lookup-not-keep-and-low-load-hit-geo-regression (keeps rows where u64_autotriage_first_eval_outcome=actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression)
  • --only-first-eval-pipeline-failed-lookup-not-keep (keeps rows where u64_autotriage_first_eval_outcome=pipeline_failed_lookup_not_keep)
  • --only-first-eval-pipeline-failed (keeps rows where u64_autotriage_first_eval_outcome starts with pipeline_failed)
  • --only-first-eval-pipeline-failed-low-load-hit-geo-regression (keeps rows where u64_autotriage_first_eval_outcome=pipeline_failed_low_load_hit_geo_regression)
  • --only-first-eval-pipeline-failed-lookup-not-keep-and-low-load-hit-geo-regression (keeps rows where u64_autotriage_first_eval_outcome=pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression)
  • --only-latest-failure-family-pipeline-lookup-not-keep (keeps rows where latest autotriage failure family indicates pipeline lookup-not-keep, including combined pipeline lookup+low-load families)
  • --only-latest-failure-family-pipeline-low-load-hit-geo-regression (keeps rows where latest autotriage failure family indicates pipeline low-load hit-geo regression, including combined pipeline lookup+low-load families)
  • --only-latest-failure-family-pipeline-lookup-not-keep-and-low-load-hit-geo-regression (keeps rows where latest autotriage failure family indicates combined pipeline lookup-not-keep + low-load hit-geo regression)
  • --only-latest-failure-family-recheck-lookup-not-keep (keeps rows where latest autotriage failure family indicates recheck lookup-not-keep, including combined recheck lookup+low-load families)
  • --only-latest-failure-family-recheck-low-load-hit-geo-regression (keeps rows where latest autotriage failure family indicates recheck low-load hit-geo regression, including combined recheck lookup+low-load families)
  • --only-latest-failure-family-recheck-lookup-not-keep-and-low-load-hit-geo-regression (keeps rows where latest autotriage failure family indicates combined recheck lookup-not-keep + low-load hit-geo regression)
  • --only-first-eval-actionable-then-failed-known (keeps rows where the derived first-eval-actionable-then-failed state is yes or no)
  • --only-first-eval-pipeline-verdict-gate-runs-completed-known (keeps rows where first_eval_outcome starts with pipeline_failed and autotriage pipeline verdict-gate runs-completed metadata is numeric)
  • --only-first-eval-pipeline-verdict-gate-runs-completed-unknown (keeps rows where first_eval_outcome starts with pipeline_failed and autotriage pipeline verdict-gate runs-completed metadata is unknown)
  • --only-first-eval-pipeline-verdict-gate-runs-completed-eq <N> (keeps rows where first_eval_outcome starts with pipeline_failed and autotriage pipeline verdict-gate runs-completed equals <N>)
  • --only-first-eval-pipeline-verdict-gate-runs-completed-min <N> (keeps rows where first_eval_outcome starts with pipeline_failed and autotriage pipeline verdict-gate runs-completed is >= <N>)
  • --only-first-eval-pipeline-verdict-gate-runs-completed-max <N> (keeps rows where first_eval_outcome starts with pipeline_failed and autotriage pipeline verdict-gate runs-completed is <= <N>)
  • --only-first-eval-recheck-verdict-gate-runs-completed-known (keeps rows where first_eval_outcome starts with actionable_recheck_failed and autotriage recheck verdict-gate runs-completed metadata is numeric)
  • --only-first-eval-recheck-verdict-gate-runs-completed-unknown (keeps rows where first_eval_outcome starts with actionable_recheck_failed and autotriage recheck verdict-gate runs-completed metadata is unknown)
  • --only-first-eval-recheck-verdict-gate-runs-completed-eq <N> (keeps rows where first_eval_outcome starts with actionable_recheck_failed and autotriage recheck verdict-gate runs-completed equals <N>)
  • --only-first-eval-recheck-verdict-gate-runs-completed-min <N> (keeps rows where first_eval_outcome starts with actionable_recheck_failed and autotriage recheck verdict-gate runs-completed is >= <N>)
  • --only-first-eval-recheck-verdict-gate-runs-completed-max <N> (keeps rows where first_eval_outcome starts with actionable_recheck_failed and autotriage recheck verdict-gate runs-completed is <= <N>)
  • --only-first-eval-actionable-then-failed-unknown (keeps rows where the derived first-eval-actionable-then-failed state is unknown)
  • --only-first-eval-actionable-then-failed-no (keeps rows where the derived first-eval-actionable-then-failed state is no)
  • --only-first-eval-actionable-then-failed-yes (keeps rows where the derived first-eval-actionable-then-failed state is yes)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage pipeline verdict-gate metadata reports early-stop)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-inconclusive (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage pipeline verdict-gate early-stop verdict is inconclusive)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-reject (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage pipeline verdict-gate early-stop verdict is reject)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-unknown (keeps rows where derived first-eval-actionable-then-failed is yes, the pipeline verdict-gate reports early-stop, and the early-stop verdict is missing/unknown)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage recheck verdict-gate metadata reports early-stop)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-inconclusive (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage recheck verdict-gate early-stop verdict is inconclusive)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-reject (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage recheck verdict-gate early-stop verdict is reject)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-unknown (keeps rows where derived first-eval-actionable-then-failed is yes, the recheck verdict-gate reports early-stop, and the early-stop verdict is missing/unknown)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-known (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage pipeline verdict-gate runs-completed is numeric)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-unknown (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage pipeline verdict-gate runs-completed is unknown)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-eq <N> (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage pipeline verdict-gate runs-completed equals <N>)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-known (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage recheck verdict-gate runs-completed is numeric)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-unknown (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage recheck verdict-gate runs-completed is unknown)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-eq <N> (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage recheck verdict-gate runs-completed equals <N>)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-min <N> (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage pipeline verdict-gate runs-completed is >= <N>)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-max <N> (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage pipeline verdict-gate runs-completed is <= <N>)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-min <N> (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage recheck verdict-gate runs-completed is >= <N>)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-max <N> (keeps rows where derived first-eval-actionable-then-failed is yes and autotriage recheck verdict-gate runs-completed is <= <N>)
  • --only-pipeline-early-stop-unknown (keeps rows where autotriage pipeline verdict-gate metadata reports early_stopped=yes and the recorded early-stop verdict is unknown/missing)
  • --only-recheck-early-stop-unknown (keeps rows where autotriage recheck verdict-gate metadata reports early_stopped=yes and the recorded early-stop verdict is unknown/missing)
  • --only-pipeline-verdict-gate-runs-completed-known (keeps rows where autotriage pipeline verdict-gate runs-completed metadata is numeric)
  • --only-pipeline-verdict-gate-runs-completed-unknown (keeps rows where autotriage pipeline verdict-gate runs-completed metadata is unknown/missing)
  • --only-recheck-verdict-gate-runs-completed-known (keeps rows where autotriage recheck verdict-gate runs-completed metadata is numeric)
  • --only-recheck-verdict-gate-runs-completed-unknown (keeps rows where autotriage recheck verdict-gate runs-completed metadata is unknown/missing)
  • --only-pipeline-verdict-gate-runs-completed-eq <N> (keeps rows where autotriage pipeline verdict-gate runs-completed metadata equals <N>)
  • --only-pipeline-verdict-gate-runs-completed-min <N> (keeps rows where autotriage pipeline verdict-gate runs-completed metadata is >= <N>)
  • --only-pipeline-verdict-gate-runs-completed-max <N> (keeps rows where autotriage pipeline verdict-gate runs-completed metadata is <= <N>)
  • --only-recheck-verdict-gate-runs-completed-eq <N> (keeps rows where autotriage recheck verdict-gate runs-completed metadata equals <N>)
  • --only-recheck-verdict-gate-runs-completed-min <N> (keeps rows where autotriage recheck verdict-gate runs-completed metadata is >= <N>)
  • --only-recheck-verdict-gate-runs-completed-max <N> (keeps rows where autotriage recheck verdict-gate runs-completed metadata is <= <N>)
  • --limit <N> (-1 keeps all rows)
  • --promotion-filter all|lookup_actionable|insert_actionable|any_actionable (uses derived lookup_promotion_status / insert_promotion_status columns to quickly focus on candidate labels that are promotable vs blocked)
  • --promotion-filter strict_lookup_ready (requires lookup=keep plus a passed dual-gate row; surfaces stricter promotion-ready labels via lookup_strict_promotion_status)
  • --promotion-filter u64_pipeline_failed|u64_pipeline_low_load_hit_geo_regressed|u64_pipeline_low_load_skipped|u64_pipeline_low_load_skipped_lookup_not_keep|u64_pipeline_lookup_gate_early_stopped|u64_pipeline_lookup_gate_early_stopped_inconclusive|u64_pipeline_lookup_gate_early_stopped_reject|u64_pipeline_lookup_gate_early_stopped_unknown|u64_pipeline_lookup_gate_runs_completed_known|u64_pipeline_lookup_gate_runs_completed_unknown|u64_pipeline_lookup_gate_runs_completed_eq3|u64_pipeline_lookup_gate_runs_completed_eq4|u64_pipeline_lookup_gate_runs_completed_in_range|u64_pipeline_actionable|u64_pipeline_actionable_confirmed (filters labels by the latest *-u64-candidate-pipeline.tsv outcome metadata; u64_pipeline_low_load_skipped surfaces rows where pipeline low-load stage was skipped, and u64_pipeline_low_load_skipped_lookup_not_keep narrows this to explicit lookup-verdict-not-keep skip reasons recorded in u64_pipeline_low_load_skip_reason; u64_pipeline_lookup_gate_early_stopped surfaces rows where lookup verdict gate stopped before exhausting requested gate runs, with ..._inconclusive/..._reject/..._unknown as direct splits by recorded early-stop verdict; status-board rows also expose u64_pipeline_lookup_verdict_gate_runs_completed, u64_pipeline_lookup_verdict_gate_early_stopped, and u64_pipeline_lookup_verdict_gate_early_stop_verdict from the underlying lookup verdict sidecar metadata; u64_pipeline_lookup_gate_runs_completed_* promotion filters slice this same run-count metadata into known/unknown/eq3/eq4/in-range cohorts (the _in_range variant requires --only-pipeline-verdict-gate-runs-completed-min and/or --only-pipeline-verdict-gate-runs-completed-max); u64_pipeline_actionable requires an explicit non-regressed low-load hit-geo signal plus lookup_combined_verdict=keep; u64_pipeline_actionable_confirmed additionally requires latest u64_eval_outcome=actionable)
  • --promotion-filter u64_eval_failed|u64_eval_actionable|u64_eval_actionable_confirmed|u64_eval_actionable_recheck_failed|u64_eval_actionable_unconfirmed|u64_eval_pipeline_failed_lookup_not_keep|u64_eval_pipeline_failed_low_load_hit_geo_regression|u64_eval_pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression|u64_eval_actionable_recheck_failed_lookup_not_keep|u64_eval_actionable_recheck_failed_low_load_hit_geo_regression|u64_eval_actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression|u64_eval_pipeline_lookup_not_keep|u64_eval_pipeline_low_load_hit_geo_regression|u64_eval_recheck_pipeline_failed|u64_eval_recheck_pipeline_lookup_not_keep|u64_eval_recheck_pipeline_low_load_hit_geo_regression|u64_autotriage_failed|u64_autotriage_actionable|u64_autotriage_actionable_confirmed|u64_autotriage_confirmation_complete|u64_autotriage_confirmation_incomplete|u64_autotriage_confirmation_multi_run|u64_autotriage_confirmation_floor_applied|u64_autotriage_confirmation_floor_not_applied|u64_autotriage_first_eval_actionable|u64_autotriage_first_eval_actionable_recheck_failed|u64_autotriage_first_eval_actionable_recheck_failed_lookup_not_keep|u64_autotriage_first_eval_actionable_recheck_failed_low_load_hit_geo_regression|u64_autotriage_first_eval_pipeline_failed_lookup_not_keep|u64_autotriage_first_eval_pipeline_failed_low_load_hit_geo_regression|u64_autotriage_first_eval_actionable_then_failed|u64_autotriage_first_eval_actionable_then_failed_yes|u64_autotriage_first_eval_actionable_then_failed_known|u64_autotriage_first_eval_actionable_then_failed_unknown|u64_autotriage_first_eval_actionable_then_failed_no|u64_autotriage_streak_actionable_before_failure|u64_autotriage_failed_lookup_not_keep|u64_autotriage_failed_low_load_hit_geo_regression|u64_autotriage_failed_lookup_not_keep_and_low_load_hit_geo_regression|u64_autotriage_failed_pipeline_lookup_not_keep|u64_autotriage_failed_pipeline_low_load_hit_geo_regression|u64_autotriage_failed_pipeline_lookup_not_keep_and_low_load_hit_geo_regression|u64_autotriage_failed_recheck_lookup_not_keep|u64_autotriage_failed_recheck_low_load_hit_geo_regression|u64_autotriage_failed_recheck_lookup_not_keep_and_low_load_hit_geo_regression (additional first-eval split promotion filters: u64_autotriage_first_eval_failed, u64_autotriage_first_eval_lookup_not_keep, u64_autotriage_first_eval_low_load_hit_geo_regression, u64_autotriage_first_eval_pipeline_failed, u64_autotriage_first_eval_pipeline_verdict_gate_early_stopped, u64_autotriage_first_eval_pipeline_verdict_gate_early_stopped_inconclusive, u64_autotriage_first_eval_pipeline_verdict_gate_early_stopped_reject, u64_autotriage_first_eval_pipeline_verdict_gate_early_stopped_unknown, u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_known, u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_unknown, u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_eq3, u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_eq4, u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_in_range, u64_autotriage_first_eval_recheck_pipeline_verdict_gate_early_stopped, u64_autotriage_first_eval_recheck_pipeline_verdict_gate_early_stopped_inconclusive, u64_autotriage_first_eval_recheck_pipeline_verdict_gate_early_stopped_reject, u64_autotriage_first_eval_recheck_pipeline_verdict_gate_early_stopped_unknown, u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_known, u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_unknown, u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_eq3, u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_eq4, u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_in_range, u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped, u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped_inconclusive, u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped_reject, u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped_unknown, u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_known, u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_unknown, u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_eq3, u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_eq4, u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_in_range, u64_autotriage_first_eval_actionable_then_failed_pipeline_verdict_gate_runs_completed_in_range, u64_autotriage_first_eval_actionable_then_failed_recheck_verdict_gate_runs_completed_in_range, u64_autotriage_first_eval_actionable_then_failed_pipeline_verdict_gate_early_stopped, u64_autotriage_first_eval_actionable_then_failed_pipeline_verdict_gate_early_stopped_inconclusive, u64_autotriage_first_eval_actionable_then_failed_pipeline_verdict_gate_early_stopped_reject, u64_autotriage_first_eval_actionable_then_failed_pipeline_verdict_gate_early_stopped_unknown, u64_autotriage_first_eval_actionable_then_failed_recheck_verdict_gate_early_stopped, u64_autotriage_first_eval_actionable_then_failed_recheck_verdict_gate_early_stopped_inconclusive, u64_autotriage_first_eval_actionable_then_failed_recheck_verdict_gate_early_stopped_reject, u64_autotriage_first_eval_actionable_then_failed_recheck_verdict_gate_early_stopped_unknown, u64_autotriage_first_eval_actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression and u64_autotriage_first_eval_pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression) (additional autotriage early-stop promotion filters: u64_autotriage_pipeline_verdict_gate_early_stopped, u64_autotriage_pipeline_verdict_gate_early_stopped_inconclusive, u64_autotriage_pipeline_verdict_gate_early_stopped_reject, u64_autotriage_pipeline_verdict_gate_early_stopped_unknown, u64_autotriage_recheck_pipeline_verdict_gate_early_stopped, u64_autotriage_recheck_pipeline_verdict_gate_early_stopped_inconclusive, u64_autotriage_recheck_pipeline_verdict_gate_early_stopped_reject, u64_autotriage_recheck_pipeline_verdict_gate_early_stopped_unknown, u64_autotriage_pipeline_verdict_gate_runs_completed_known, u64_autotriage_pipeline_verdict_gate_runs_completed_unknown, u64_autotriage_pipeline_verdict_gate_runs_completed_eq3, u64_autotriage_pipeline_verdict_gate_runs_completed_eq4, u64_autotriage_pipeline_verdict_gate_runs_completed_in_range, u64_autotriage_recheck_pipeline_verdict_gate_runs_completed_known, u64_autotriage_recheck_pipeline_verdict_gate_runs_completed_unknown, u64_autotriage_recheck_pipeline_verdict_gate_runs_completed_eq3, u64_autotriage_recheck_pipeline_verdict_gate_runs_completed_eq4, and u64_autotriage_recheck_pipeline_verdict_gate_runs_completed_in_range) (filters labels by the latest *-u64-candidate-eval.tsv outcome metadata, including automatic actionable recheck outcomes; u64_eval_failed also includes actionable_unconfirmed and status_board_missing_label; the u64_eval_pipeline_* and u64_eval_recheck_pipeline_* filters classify eval rows by first-pass vs recheck pipeline reasons, using eval sidecar row-count columns (with final-message fallback). This includes u64_eval_pipeline_low_load_skipped and u64_eval_pipeline_low_load_skipped_lookup_not_keep, sourced from eval-side pipeline skip snapshot row counts, plus u64_eval_pipeline_verdict_gate_early_stopped* and u64_eval_recheck_pipeline_verdict_gate_early_stopped* filters sourced from eval-side verdict-gate metadata columns. Additional scoped variants u64_eval_pipeline_failed_verdict_gate_early_stopped* and u64_eval_actionable_recheck_failed_verdict_gate_early_stopped* narrow those early-stop filters to pipeline_failed* and actionable_recheck_failed* outcomes respectively. Run-count scoped variants (u64_eval_pipeline_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} and u64_eval_actionable_recheck_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}) provide matching filters over eval-side verdict-gate run-count metadata; the _in_range variants require at least one corresponding bound flag: --only-eval-pipeline-failed-verdict-gate-runs-completed-{min,max} or --only-eval-actionable-recheck-failed-verdict-gate-runs-completed-{min,max}; status-board rows now also include u64_autotriage_latest_confirmed_actionable_rows and u64_autotriage_latest_failure_family, and the u64_autotriage_failed_* filters (including pipeline-only and recheck-only variants) consult that family value so legacy rows are classified consistently; rows also include u64_autotriage_confirmation_runs, u64_autotriage_confirmation_runs_requested, u64_autotriage_min_confirmation_runs_on_success, u64_autotriage_completed_eval_runs, u64_autotriage_last_eval_label, u64_autotriage_requested_candidate_label, and u64_autotriage_latest_candidate_label metadata from autotriage sidecars, plus run-sequence diagnostics: u64_autotriage_first_eval_label, u64_autotriage_first_eval_outcome, u64_autotriage_first_eval_exit_status, u64_autotriage_first_eval_actionable_then_failed, u64_autotriage_streak_actionable_before_failure, and u64_autotriage_run_outcome_sequence; rows also include latest eval verdict-gate metadata from autotriage sidecars: u64_autotriage_pipeline_lookup_verdict_gate_runs_completed, u64_autotriage_pipeline_lookup_verdict_gate_early_stopped, u64_autotriage_pipeline_lookup_verdict_gate_early_stop_verdict, u64_autotriage_recheck_pipeline_lookup_verdict_gate_runs_completed, u64_autotriage_recheck_pipeline_lookup_verdict_gate_early_stopped, and u64_autotriage_recheck_pipeline_lookup_verdict_gate_early_stop_verdict. The u64_autotriage_first_eval_actionable, u64_autotriage_first_eval_failed, u64_autotriage_first_eval_lookup_not_keep, u64_autotriage_first_eval_low_load_hit_geo_regression, u64_autotriage_first_eval_actionable_recheck_failed, u64_autotriage_first_eval_actionable_recheck_failed_lookup_not_keep, u64_autotriage_first_eval_actionable_recheck_failed_low_load_hit_geo_regression, u64_autotriage_first_eval_actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression, u64_autotriage_first_eval_pipeline_failed, u64_autotriage_first_eval_pipeline_failed_lookup_not_keep, and u64_autotriage_first_eval_pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression, u64_autotriage_first_eval_actionable_then_failed promotion filters are useful for isolating streak-instability candidates where run 1 looked promotable but later confirmation runs failed. The first-eval pipeline/recheck verdict-gate early-stop filters provide the same view with explicit inconclusive vs reject splits while keeping first-eval scope constraints (pipeline_failed* vs actionable_recheck_failed*) intact. The first-eval run-count promotion filters provide quick known/eq/range splits for first-pass (pipeline_failed*) and recheck (actionable_recheck_failed*) verdict-gate metadata directly on status-board rows. The _in_range variants require at least one matching min/max flag for the corresponding row filter family: --only-first-eval-pipeline-verdict-gate-runs-completed-{min,max}, --only-first-eval-recheck-verdict-gate-runs-completed-{min,max}, --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-{min,max}, or --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-{min,max}. The _yes, _known, _unknown, and _no variants help separate first-eval instability rows from legacy rows lacking first-eval telemetry and from rows that are instrumented but stable. Additional u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} filters narrow that instability cohort to rows with numeric verdict-gate run-count metadata, explicit unknown metadata buckets, exact =3/=4 completion, or configured min/max ranges for first-pass vs recheck confirmation lanes. Additional u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_early_stopped{,_inconclusive,_reject} filters narrow that same instability cohort to rows where pipeline/recheck verdict gates early-stopped, with optional explicit early-stop verdict splits. The u64_autotriage_*_verdict_gate_early_stopped* filters are useful for splitting those first-eval failures by recorded early-stop verdict class (inconclusive vs reject) without leaving status-board workflows. Rows also include a derived u64_autotriage_confirmation_state column (single_run_complete, single_run_incomplete, multi_run_complete, multi_run_incomplete) and u64_autotriage_confirmation_floor_applied (yes/no/unknown) (u64_autotriage_confirmation_incomplete catches rows where completed runs are fewer than requested confirmation runs))
  • --max-age-hours <N> (optional recency filter based on sidecar timestamp prefixes)

To rank low-load regression candidates from status-board rows, use:

python3 scripts/bench_low_load_candidate_pick.py --limit 20

Useful controls:

  • --promotion-filter all|cycle_low_load_picker_score_below|cycle_low_load_picker_score_unknown|cycle_low_load_picker_score_below_threshold|cycle_low_load_picker_score_unknown_or_below_threshold
  • --cycle-low-load-picker-score-min <N> (used with score-threshold filters)
  • --inject-labels <csv> (force-include specific labels, useful for ensuring the just-produced low-load label appears in ranking output)
  • --strict/--no-strict (effective-load parity requirement)
  • --exclude-label-substrings <csv>
  • --max-age-hours <N>

scripts/bench_lookup_candidate_cycle.sh now forwards these controls through:

  • HARRIER_BENCH_LOW_LOAD_PICKER_PROMOTION_FILTER (auto by default; auto resolves to cycle_low_load_picker_score_unknown_or_below_threshold when HARRIER_BENCH_CYCLE_FAIL_ON_LOW_LOAD_PICKER_SCORE_BELOW=1, else all)
  • HARRIER_BENCH_CYCLE_LOW_LOAD_PICKER_SCORE_MIN
  • HARRIER_BENCH_RUN_LOOKUP_GATE (1 by default; set to 0 to skip the lookup gate stage and run only downstream cycle stages such as low-load curve
    • picker)
  • HARRIER_BENCH_LOW_LOAD_PICKER_INJECT_CURRENT_LABEL (1 by default; when enabled, automatically injects the current low-load-post-<candidate> label into picker ranking output even if promotion filters would exclude it)
  • HARRIER_BENCH_LOW_LOAD_PICKER_INJECT_LABELS (optional comma-separated additional labels to inject into picker ranking output)

For a u64-focused adaptive low-load cycle with these stricter picker defaults pre-wired, use:

bash scripts/bench_lookup_candidate_cycle_u64_low_load.sh

This wrapper defaults to:

  • HARRIER_BENCH_LOOKUP_SCOPE=u64
  • HARRIER_BENCH_CYCLE_ONLY_IMPL=harrier_u64,hashbrown
  • HARRIER_BENCH_CYCLE_CONTINUE_ON_LOOKUP_FAILURE=1
  • HARRIER_BENCH_CYCLE_FAIL_ON_LOOKUP_FAILURE=0
  • HARRIER_BENCH_RUN_INSERT_MONITOR=0
  • HARRIER_BENCH_RUN_LOW_LOAD_CURVE=1
  • HARRIER_BENCH_RUN_LOW_LOAD_PICKER=1
  • HARRIER_BENCH_LOW_LOAD_PICKER_PROMOTION_FILTER=auto
  • HARRIER_BENCH_CYCLE_FAIL_ON_LOW_LOAD_PICKER_SCORE_BELOW=1
  • HARRIER_BENCH_CYCLE_LOW_LOAD_PICKER_SCORE_MIN=15
  • HARRIER_BENCH_CYCLE_FAIL_ON_LOW_LOAD_HIT_GEO_REGRESSION=1
  • HARRIER_BENCH_CYCLE_LOW_LOAD_HIT_GEO_MIN=1.0
  • HARRIER_BENCH_RUN_STATUS_BOARD_SUMMARY=0

All wrapper defaults remain overridable via environment variables.

To run a u64 candidate loop that first executes the strict u64 lookup gate and then runs the u64 low-load cycle with lookup reruns disabled, use:

bash scripts/bench_lookup_u64_candidate_pipeline.sh

This pipeline defaults to:

  • HARRIER_BENCH_PIPELINE_RUN_LOOKUP_GATE=1
  • HARRIER_BENCH_PIPELINE_RUN_LOW_LOAD_CYCLE=1
  • HARRIER_BENCH_PIPELINE_REQUIRE_LOOKUP_PASS_FOR_LOW_LOAD=1
  • HARRIER_BENCH_PIPELINE_REQUIRE_LOOKUP_KEEP=1
  • HARRIER_BENCH_PIPELINE_SKIP_LOW_LOAD_WHEN_LOOKUP_NOT_KEEP=1 (when lookup gate passes but combined verdict is not keep, skip low-load cycle stage to fail fast in strict loops)
  • HARRIER_BENCH_PIPELINE_FAIL_ON_LOW_LOAD_HIT_GEO_REGRESSION=1
  • HARRIER_BENCH_PIPELINE_VALIDATE_PROMOTION_PARITY=1 (default; validates compact promotion-filter parity via scripts/check_autotriage_promotion_parity.py before pipeline execution; set to 0 to skip this preflight guard)
  • low-load stage invocation via scripts/bench_lookup_candidate_cycle_u64_low_load.sh with HARRIER_BENCH_RUN_LOOKUP_GATE=0
  • report output: <stamp>-<candidate>-u64-candidate-pipeline.tsv including lookup verdict metadata and cycle low-load picker summary fields (low_load_skip_reason, lookup_verdict_gate_runs_completed, lookup_verdict_gate_early_stopped, lookup_verdict_gate_early_stop_verdict, cycle_low_load_picker_rows, top score, score-threshold flag, hit-geo ratio/regressed flag, etc.).

Note: when HARRIER_BENCH_PIPELINE_RUN_LOOKUP_GATE=0, the pipeline still runs the low-load cycle stage (the "require lookup pass" check applies only when the lookup gate stage is enabled).

Diagnostic mode examples:

# Preview commands only
HARRIER_BENCH_PIPELINE_DRY_RUN=1 bash scripts/bench_lookup_u64_candidate_pipeline.sh

# Run low-load cycle even if lookup gate fails
HARRIER_BENCH_PIPELINE_REQUIRE_LOOKUP_PASS_FOR_LOW_LOAD=0 \
bash scripts/bench_lookup_u64_candidate_pipeline.sh

# Allow non-keep lookup verdicts to continue for diagnostics
HARRIER_BENCH_PIPELINE_REQUIRE_LOOKUP_KEEP=0 \
bash scripts/bench_lookup_u64_candidate_pipeline.sh

# Keep running low-load cycle even when lookup verdict is not keep
HARRIER_BENCH_PIPELINE_SKIP_LOW_LOAD_WHEN_LOOKUP_NOT_KEEP=0 \
bash scripts/bench_lookup_u64_candidate_pipeline.sh

To run the u64 pipeline plus automatic status-board triage snapshots for one candidate label, use:

bash scripts/bench_lookup_u64_candidate_eval.sh

This eval wrapper:

  • runs bench_lookup_u64_candidate_pipeline.sh
  • writes label-filtered status-board snapshots:
    • full label row(s)
    • u64_pipeline_failed filter
    • u64_pipeline_low_load_hit_geo_regressed filter
    • u64_pipeline_actionable filter
    • u64_pipeline_low_load_skipped filter
    • u64_pipeline_low_load_skipped_lookup_not_keep filter
  • after writing its eval sidecar, also writes post-eval snapshots:
    • u64_eval_failed
    • u64_eval_actionable
    • u64_eval_actionable_recheck_failed
    • u64_pipeline_actionable_confirmed
    • u64_eval_pipeline_lookup_not_keep
    • u64_eval_pipeline_low_load_hit_geo_regression
    • u64_eval_recheck_pipeline_lookup_not_keep
    • u64_eval_recheck_pipeline_low_load_hit_geo_regression
  • writes <stamp>-<candidate>-u64-candidate-eval.tsv linking all produced sidecars, row counts, and an eval_outcome classification (actionable, blocked_low_load_hit_geo_regressed, blocked_pipeline_failed, no_actionable_signal, pipeline_failed, pipeline_failed_lookup_not_keep, pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression, pipeline_failed_low_load_hit_geo_regression, actionable_recheck_failed, actionable_recheck_failed_lookup_not_keep, actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression, actionable_recheck_failed_low_load_hit_geo_regression, actionable_unconfirmed, status_board_missing_label, or dry_run). The eval TSV also records pipeline/recheck final metadata fields (pipeline_final_status, pipeline_final_message, recheck_pipeline_final_status, recheck_pipeline_final_message) plus pipeline low-load skip snapshots/row counts (status_board_u64_pipeline_low_load_skipped_*) to speed triage. The metadata also includes candidate_label_base (defaults to the candidate label, but can be overridden by callers for grouped streak runs).
  • defaults to exact label matching for status-board snapshots (HARRIER_BENCH_EVAL_LABEL_FILTER_EXACT=1), so <label>-recheck rows do not leak into base-label triage snapshots.
  • by default (HARRIER_BENCH_EVAL_RECHECK_ON_ACTIONABLE=1), if the first pass classifies a candidate as actionable, the eval wrapper automatically runs a second pipeline pass with label suffix HARRIER_BENCH_EVAL_RECHECK_LABEL_SUFFIX (default -recheck) and downgrades the final result to actionable_recheck_failed if that recheck fails.

Useful controls:

  • HARRIER_BENCH_EVAL_CONTINUE_ON_PIPELINE_FAILURE=1 (default)
  • HARRIER_BENCH_EVAL_DRY_RUN=1
  • HARRIER_BENCH_EVAL_RUN_STATUS_BOARD=0
  • HARRIER_BENCH_EVAL_RUN_POST_EVAL_STATUS_BOARD=0 (skip post-eval u64_eval_* snapshots)
  • HARRIER_BENCH_EVAL_REQUIRE_ACTIONABLE=1 (fail unless eval_outcome=actionable)
  • HARRIER_BENCH_EVAL_REQUIRE_CONFIRMED_ACTIONABLE=1 (default; when outcome is actionable, additionally requires at least one u64_pipeline_actionable_confirmed row in post-eval snapshots; in live mode this also requires HARRIER_BENCH_EVAL_RUN_STATUS_BOARD=1 and HARRIER_BENCH_EVAL_RUN_POST_EVAL_STATUS_BOARD=1)
  • HARRIER_BENCH_EVAL_REQUIRE_LABEL_ROW=1 (default; in live mode requires at least one exact-label status-board row, otherwise sets eval_outcome=status_board_missing_label; requires HARRIER_BENCH_EVAL_RUN_STATUS_BOARD=1)
  • HARRIER_BENCH_EVAL_LABEL_FILTER_EXACT=0 (revert to substring label matching for snapshots)
  • HARRIER_BENCH_EVAL_RECHECK_ON_ACTIONABLE=0 (disable automatic actionable recheck)
  • HARRIER_BENCH_EVAL_RECHECK_LABEL_SUFFIX=-recheck
  • HARRIER_BENCH_EVAL_VALIDATE_PROMOTION_PARITY=1 (default; validates compact promotion-filter parity via scripts/check_autotriage_promotion_parity.py before pipeline/eval execution; set to 0 to skip this preflight guard)
  • HARRIER_BENCH_EVAL_CANDIDATE_LABEL_BASE=<label-base> (optional metadata override used by wrappers to keep grouped eval streaks under one base label)

To run strict eval + outcome/failure summary in one command, use:

bash scripts/bench_lookup_u64_candidate_autotriage.sh

You can also pass a label as the first positional argument (equivalent to setting HARRIER_BENCH_CANDIDATE_LABEL):

bash scripts/bench_lookup_u64_candidate_autotriage.sh u64-my-candidate

This wrapper defaults to strict gates (require_actionable=1, confirmed actionable, exact-label filtering, actionable recheck), runs the eval wrapper, then prints:

  • latest exact-label eval row
  • exact-label attributed failure-reason summary
  • optional verbose row tables toggle (HARRIER_BENCH_AUTOTRIAGE_SHOW_ROW_SECTIONS=1; set to 0 to keep summaries while skipping high-volume row sections in both base-history and global boards)
    • when row sections are disabled, compact candidate output still prints first-eval pipeline/recheck verdict-gate runs-completed status-board promotion slices (*_known, *_unknown, *_eq3, *_eq4, *_in_range with explicit [3,4] bounds) plus first-eval-actionable-then-failed pipeline/recheck runs-completed slices with the same compact split set, and still prints first-eval pipeline/recheck verdict-gate early-stop status-board promotion slices and first-eval-actionable-then-failed verdict-gate early-stop slices for pipeline/recheck (including *_inconclusive/*_reject/*_unknown subsets)
  • optional extended summaries toggle (HARRIER_BENCH_AUTOTRIAGE_SHOW_EXTENDED_SUMMARIES=1; set to 0 to keep only compact summary subsets in base-history/global boards before row-table controls are applied)
    • when extended summaries are disabled, compact global output still prints first-eval pipeline/recheck verdict-gate runs-completed status-board promotion boards (*_known, *_unknown, *_eq3, *_eq4, *_in_range with explicit [3,4] bounds) plus matching first-eval-actionable-then-failed pipeline/recheck runs-completed promotion boards, in addition to compact early-stop boards (including *_inconclusive/*_reject/*_unknown subsets)
  • compact-output mode is enabled by default (HARRIER_BENCH_AUTOTRIAGE_COMPACT_OUTPUT=1; shorthand that forces both HARRIER_BENCH_AUTOTRIAGE_SHOW_EXTENDED_SUMMARIES=0 and HARRIER_BENCH_AUTOTRIAGE_SHOW_ROW_SECTIONS=0; set it to 0 to disable the compact default)
  • promotion-filter parity validation is enabled by default before eval runs (HARRIER_BENCH_AUTOTRIAGE_VALIDATE_PROMOTION_PARITY=1; runs python3 scripts/check_autotriage_promotion_parity.py and aborts wrapper execution on parity drift; set it to 0 to skip this guardrail). The checker verifies compact first-eval/eval run-count + early-stop family coverage across candidate/global boards, status-board promotion-filter implementation, README family markers, and wrapper parity-guard wiring.
  • run-count summaries default to known-metadata rows only (HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=0; set to 1 to include legacy rows where verdict-gate run-count fields are missing and therefore summarized as unknown)
  • optional eval failure-family run-count range overlays (HARRIER_BENCH_AUTOTRIAGE_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_MIN, ..._PIPELINE_RUNS_MAX, HARRIER_BENCH_AUTOTRIAGE_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_MIN, ..._RECHECK_RUNS_MAX; when set, compact base-label/global summaries and candidate/global row sections include configured-range views for eval pipeline_failed* pipeline and actionable_recheck_failed* recheck verdict-gate run-count cohorts)
  • optional first-eval-actionable-then-failed run-count range overlays (HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_THEN_FAILED_PIPELINE_RUNS_MIN, ..._PIPELINE_RUNS_MAX, ..._RECHECK_RUNS_MIN, ..._RECHECK_RUNS_MAX; when set, compact base-label summaries and extended global summaries include extra configured-range boards for first-eval actionable-then-failed pipeline and recheck verdict-gate run-count distributions)
  • optional first-eval failure-family run-count range overlays (HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_MIN, ..._PIPELINE_RUNS_MAX, HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_MIN, ..._RECHECK_RUNS_MAX; when set, compact base-label summaries plus candidate/global row sections include configured-range views for first-eval pipeline_failed* pipeline and actionable_recheck_failed* recheck verdict-gate run-count cohorts)
  • optional base-label history summary (HARRIER_BENCH_AUTOTRIAGE_SHOW_BASE_HISTORY=1)
    • prints eval-outcome counts and eval failure-family counts for the same normalized base label (failure-family summary is failed rows only)
    • additionally prints eval-side pipeline low-load-skipped failure-family counts for that base label (--only-pipeline-low-load-skipped)
    • additionally prints eval-side pipeline low-load-skipped lookup-not-keep failure-family counts for that base label (--only-pipeline-low-load-skipped-lookup-not-keep)
    • additionally prints eval-side pipeline verdict-gate early-stop verdict counts for that base label (--only-pipeline-verdict-gate-early-stopped --summary-key pipeline_lookup_verdict_gate_early_stop_verdict)
    • additionally prints eval-side pipeline verdict-gate runs-completed distributions for that base label (--summary-key pipeline_lookup_verdict_gate_runs_completed with --only-pipeline-verdict-gate-runs-completed-known by default; set HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1 to include unknown buckets)
    • additionally prints eval-side recheck verdict-gate early-stop verdict counts for that base label (--only-recheck-verdict-gate-early-stopped --summary-key recheck_pipeline_lookup_verdict_gate_early_stop_verdict)
    • additionally prints eval-side recheck verdict-gate runs-completed distributions for that base label (--summary-key recheck_pipeline_lookup_verdict_gate_runs_completed with --only-recheck-verdict-gate-runs-completed-known by default; set HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1 to include unknown buckets)
    • additionally prints eval-side pipeline_failed* verdict-gate early-stop verdict counts for that base label (--only-pipeline-failed-verdict-gate-early-stopped --summary-key pipeline_lookup_verdict_gate_early_stop_verdict)
    • additionally prints eval-side pipeline_failed* verdict-gate runs-completed distributions for that base label (--only-pipeline-failed-verdict-gate-runs-completed-known --summary-key pipeline_lookup_verdict_gate_runs_completed)
    • additionally prints eval-side pipeline_failed* verdict-gate runs-completed unknown-only summaries for that base label (--only-pipeline-failed-verdict-gate-runs-completed-unknown)
    • additionally prints eval-side pipeline_failed* verdict-gate runs-completed exact-3 and exact-4 summaries for that base label (--pipeline-verdict-gate-runs-completed-eq {3,4} scoped with --only-pipeline-failed-verdict-gate-runs-completed-known)
    • additionally prints eval-side pipeline_failed* verdict-gate runs-completed configured-range summaries for that base label when HARRIER_BENCH_AUTOTRIAGE_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX} is set
    • additionally prints eval-side pipeline_failed* verdict-gate early-stop inconclusive/reject/unknown focused summaries for that base label (--only-pipeline-failed-verdict-gate-early-stopped-{inconclusive,reject,unknown})
    • additionally prints eval-side actionable_recheck_failed* verdict-gate early-stop verdict counts for that base label (--only-actionable-recheck-failed-verdict-gate-early-stopped --summary-key recheck_pipeline_lookup_verdict_gate_early_stop_verdict)
    • additionally prints eval-side actionable_recheck_failed* recheck verdict-gate runs-completed distributions for that base label (--only-actionable-recheck-failed-verdict-gate-runs-completed-known --summary-key recheck_pipeline_lookup_verdict_gate_runs_completed)
    • additionally prints eval-side actionable_recheck_failed* recheck verdict-gate runs-completed unknown-only summaries for that base label (--only-actionable-recheck-failed-verdict-gate-runs-completed-unknown)
    • additionally prints eval-side actionable_recheck_failed* recheck verdict-gate runs-completed exact-3 and exact-4 summaries for that base label (--recheck-verdict-gate-runs-completed-eq {3,4} scoped with --only-actionable-recheck-failed-verdict-gate-runs-completed-known)
    • additionally prints eval-side actionable_recheck_failed* recheck verdict-gate runs-completed configured-range summaries for that base label when HARRIER_BENCH_AUTOTRIAGE_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX} is set
    • additionally prints eval-side actionable_recheck_failed* verdict-gate early-stop inconclusive/reject/unknown focused summaries for that base label (--only-actionable-recheck-failed-verdict-gate-early-stopped-{inconclusive,reject,unknown})
    • additionally prints first-eval (autotriage sidecar) pipeline verdict-gate early-stop verdict counts for failed first-eval outcomes on that base label (--only-first-eval-pipeline-verdict-gate-early-stopped --summary-key pipeline_lookup_verdict_gate_early_stop_verdict)
    • additionally prints first-eval (autotriage sidecar) pipeline verdict-gate early-stop inconclusive/reject/unknown focused summaries for failed first-eval outcomes on that base label (--only-first-eval-pipeline-verdict-gate-early-stopped-{inconclusive,reject,unknown})
    • additionally prints first-eval (autotriage sidecar) pipeline_failed* verdict-gate runs-completed distributions for that base label (--only-first-eval-pipeline-verdict-gate-runs-completed-known --summary-key pipeline_lookup_verdict_gate_runs_completed; known-only by default unless HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1)
    • additionally prints first-eval (autotriage sidecar) pipeline_failed* verdict-gate runs-completed unknown-only summaries for that base label (--only-first-eval-pipeline-verdict-gate-runs-completed-unknown)
    • additionally prints first-eval (autotriage sidecar) pipeline_failed* verdict-gate runs-completed exact-3 and exact-4 summaries for that base label (--only-first-eval-pipeline-verdict-gate-runs-completed-eq {3,4})
    • additionally prints first-eval (autotriage sidecar) pipeline_failed* verdict-gate runs-completed configured-range summaries for that base label when HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX} is set
    • additionally prints first-eval (autotriage sidecar) recheck verdict-gate early-stop verdict counts for actionable-recheck-failed first-eval outcomes on that base label (--only-first-eval-recheck-verdict-gate-early-stopped --summary-key recheck_pipeline_lookup_verdict_gate_early_stop_verdict)
    • additionally prints first-eval (autotriage sidecar) recheck verdict-gate early-stop inconclusive/reject/unknown focused summaries for actionable-recheck-failed first-eval outcomes on that base label (--only-first-eval-recheck-verdict-gate-early-stopped-{inconclusive,reject,unknown})
    • additionally prints first-eval (autotriage sidecar) actionable_recheck_failed* recheck verdict-gate runs-completed distributions for that base label (--only-first-eval-recheck-verdict-gate-runs-completed-known --summary-key recheck_pipeline_lookup_verdict_gate_runs_completed; known-only by default unless HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1)
    • additionally prints first-eval (autotriage sidecar) actionable_recheck_failed* recheck verdict-gate runs-completed unknown-only summaries for that base label (--only-first-eval-recheck-verdict-gate-runs-completed-unknown)
    • additionally prints first-eval (autotriage sidecar) actionable_recheck_failed* recheck verdict-gate runs-completed exact-3 and exact-4 summaries for that base label (--only-first-eval-recheck-verdict-gate-runs-completed-eq {3,4})
    • additionally prints first-eval (autotriage sidecar) actionable_recheck_failed* recheck verdict-gate runs-completed configured-range summaries for that base label when HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX} is set
    • additionally prints confirmed actionable eval rows for that base label (eval_outcome=actionable with confirmed actionable rows > 0)
    • additionally prints confirmed actionable autotriage rows for that base label (latest_eval_outcome=actionable with latest_confirmed_actionable_rows > 0)
    • additionally prints autotriage first-eval-actionable-then-failed distribution for that base label in compact mode (--summary-key first_eval_actionable_then_failed)
    • additionally prints autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed distributions for that base label in compact mode (--summary-key {pipeline_lookup_verdict_gate_runs_completed,recheck_pipeline_lookup_verdict_gate_runs_completed}; defaults to known-only --only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-runs-completed-known, and widens to --only-first-eval-actionable-then-failed-yes when HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1)
    • additionally prints autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed unknown-only summaries for that base label in compact mode (--only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-runs-completed-unknown)
    • additionally prints autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed exact-3 and exact-4 summaries for that base label in compact mode (--only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-runs-completed-eq {3,4})
    • additionally prints autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed configured-range summaries for that base label in compact mode when HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_THEN_FAILED_{PIPELINE,RECHECK}_RUNS_{MIN,MAX} is set
    • additionally prints autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate early-stop verdict distributions (plus explicit inconclusive/reject/unknown splits) for that base label in compact mode (--only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-early-stopped{,-inconclusive,-reject,-unknown})
    • additionally prints autotriage pipeline verdict-gate runs-completed distribution for that base label in compact mode (--summary-key pipeline_lookup_verdict_gate_runs_completed; default uses --only-pipeline-verdict-gate-runs-completed-known, and drops the known filter when HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1)
    • additionally prints autotriage recheck verdict-gate runs-completed distribution for that base label in compact mode (--summary-key recheck_pipeline_lookup_verdict_gate_runs_completed; default uses --only-recheck-verdict-gate-runs-completed-known, and drops the known filter when HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1)
    • additionally prints autotriage yes-only first-eval-actionable-then-failed run-outcome-sequence distribution for that base label in compact mode (--only-first-eval-actionable-then-failed-yes --summary-key run_outcome_sequence)
    • additionally prints autotriage confirmation-state counts for that base label (single_run_complete, single_run_incomplete, multi_run_complete, multi_run_incomplete)
    • additionally prints autotriage confirmation-floor counts for that base label (confirmation_floor_applied=yes|no|unknown)
    • additionally prints latest-failure-family pipeline-lookup-not-keep counts for that base label (latest_failure_family=pipeline_lookup_not_keep)
    • additionally prints latest-failure-family pipeline-low-load-hit-geo-regression counts for that base label (latest_failure_family=pipeline_low_load_hit_geo_regression)
    • additionally prints latest-failure-family pipeline-lookup-not-keep+low-load-hit-geo-regression counts for that base label (latest_failure_family=pipeline_lookup_not_keep+pipeline_low_load_hit_geo_regression)
    • additionally prints latest-failure-family recheck-lookup-not-keep counts for that base label (latest_failure_family=recheck_lookup_not_keep)
    • additionally prints latest-failure-family recheck-low-load-hit-geo-regression counts for that base label (latest_failure_family=recheck_low_load_hit_geo_regression)
    • additionally prints latest-failure-family recheck-lookup-not-keep+low-load-hit-geo-regression counts for that base label (latest_failure_family=recheck_lookup_not_keep+low_load_hit_geo_regression)
    • additionally prints autotriage streak-instability counts for that base label (streak_actionable_before_failure=yes|no|unknown)
    • additionally prints first-eval-outcome counts for that base label (first_eval_outcome)
    • additionally prints first-eval-failed counts for that base label (first_eval_outcome starts with pipeline_failed or actionable_recheck_failed)
    • additionally prints first-eval-lookup-not-keep counts for that base label (first_eval_outcome contains lookup_not_keep)
    • additionally prints first-eval-low-load-hit-geo-regression counts for that base label (first_eval_outcome contains low_load_hit_geo_regression)
    • additionally prints first-eval actionable counts for that base label (first_eval_outcome=actionable)
    • additionally prints first-eval actionable-recheck-failed counts for that base label (first_eval_outcome=actionable_recheck_failed*)
    • additionally prints first-eval actionable-recheck-failed-lookup-not-keep counts for that base label (first_eval_outcome=actionable_recheck_failed_lookup_not_keep)
    • additionally prints first-eval actionable-recheck-failed-low-load-hit-geo-regression counts for that base label (first_eval_outcome=actionable_recheck_failed_low_load_hit_geo_regression)
    • additionally prints first-eval actionable-recheck-failed lookup-not-keep+low-load-hit-geo-regression counts for that base label (first_eval_outcome=actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression)
    • additionally prints first-eval pipeline-failed-lookup-not-keep counts for that base label (first_eval_outcome=pipeline_failed_lookup_not_keep)
    • additionally prints first-eval pipeline-failed counts for that base label (first_eval_outcome=pipeline_failed*)
    • additionally prints first-eval pipeline-failed-low-load-hit-geo-regression counts for that base label (first_eval_outcome=pipeline_failed_low_load_hit_geo_regression)
    • additionally prints first-eval pipeline-failed lookup-not-keep+low-load-hit-geo-regression counts for that base label (first_eval_outcome=pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression)
    • additionally prints first-eval-actionable-then-failed counts for that base label (derived from first-eval outcomes that prove a run reached actionable state and still failed overall: either first_eval_outcome=actionable with non-zero eval exit status, or first_eval_outcome=actionable_recheck_failed*)
    • additionally prints first-eval-actionable-then-failed known-only counts for that base label (excludes legacy unknown rows to focus on instrumented streak telemetry)
    • additionally prints first-eval-actionable-then-failed unknown-only counts for that base label (tracks telemetry-coverage lag where first-eval fields were not yet populated)
    • additionally prints first-eval-actionable-then-failed yes-only counts for that base label (instrumented rows where run 1 was actionable and later confirmation runs failed)
    • additionally prints first-eval-actionable-then-failed no-only counts for that base label (instrumented rows where run 1 was not actionable)
    • additionally prints eval pipeline_failed* pipeline verdict-gate runs-completed rows for that base label: scoped (known-only by default, widens to --outcome-filter pipeline_failed_lookup_not_keep,pipeline_failed_low_load_hit_geo_regression,pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression when HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits
    • additionally prints configured-range eval pipeline_failed* pipeline verdict-gate runs-completed rows for that base label when HARRIER_BENCH_AUTOTRIAGE_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX} is set
    • additionally prints eval pipeline_failed* pipeline verdict-gate early-stopped rows (plus inconclusive/reject splits) for that base label
    • additionally prints eval pipeline/recheck verdict-gate early-stop unknown-only rows for that base label (non-family-scoped unknown verdict coverage via --only-{pipeline,recheck}-early-stop-unknown)
    • additionally prints eval actionable_recheck_failed* recheck verdict-gate runs-completed rows for that base label: scoped (known-only by default, widens to --outcome-filter actionable_recheck_failed_lookup_not_keep,actionable_recheck_failed_low_load_hit_geo_regression,actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression when HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits
    • additionally prints configured-range eval actionable_recheck_failed* recheck verdict-gate runs-completed rows for that base label when HARRIER_BENCH_AUTOTRIAGE_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX} is set
    • additionally prints eval actionable_recheck_failed* recheck verdict-gate early-stopped rows (plus inconclusive/reject splits) for that base label
    • additionally prints autotriage pipeline/recheck verdict-gate early-stop unknown-only rows for that base label (non-family-scoped unknown verdict coverage via --only-{pipeline,recheck}-early-stop-unknown)
    • additionally prints streak-instability rows for that base label (streak_actionable_before_failure=yes)
    • additionally prints first-eval-actionable-then-failed rows for that base label (rows where first eval reached actionable state and still failed, including actionable_recheck_failed* first-eval outcomes)
    • additionally prints known-only first-eval-actionable-then-failed rows for that base label (all yes|no rows, excluding unknown)
    • additionally prints unknown-only first-eval-actionable-then-failed rows for that base label (first_eval_actionable_then_failed=unknown)
    • additionally prints yes-only first-eval-actionable-then-failed rows for that base label (first_eval_actionable_then_failed=yes)
    • additionally prints no-only first-eval-actionable-then-failed rows for that base label (first_eval_actionable_then_failed=no)
    • additionally prints first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed rows for that base label: scoped (known-only by default, widens to yes-only scope when HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits
    • additionally prints configured-range first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed rows for that base label when HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_THEN_FAILED_{PIPELINE,RECHECK}_RUNS_{MIN,MAX} is set
    • additionally prints yes-only first-eval-actionable-then-failed rows for that base label where pipeline/recheck verdict-gate metadata early-stopped (plus explicit inconclusive/reject row splits)
    • additionally prints first-eval actionable-recheck-failed rows for that base label (first_eval_outcome=actionable_recheck_failed*)
    • additionally prints first-eval-failed rows for that base label (first_eval_outcome starts with pipeline_failed or actionable_recheck_failed)
    • additionally prints first-eval pipeline_failed* pipeline verdict-gate runs-completed rows for that base label: scoped (known-only by default, widens to --first-eval-outcome-filter pipeline_failed when HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits
    • additionally prints configured-range first-eval pipeline_failed* pipeline verdict-gate runs-completed rows for that base label when HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX} is set
    • additionally prints first-eval actionable_recheck_failed* recheck verdict-gate runs-completed rows for that base label: scoped (known-only by default, widens to --first-eval-outcome-filter actionable_recheck_failed when HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits
    • additionally prints configured-range first-eval actionable_recheck_failed* recheck verdict-gate runs-completed rows for that base label when HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX} is set
    • additionally prints first-eval-lookup-not-keep rows for that base label (first_eval_outcome contains lookup_not_keep)
    • additionally prints first-eval-low-load-hit-geo-regression rows for that base label (first_eval_outcome contains low_load_hit_geo_regression)
    • additionally prints first-eval actionable-recheck-failed-lookup-not-keep rows for that base label (first_eval_outcome=actionable_recheck_failed_lookup_not_keep)
    • additionally prints first-eval actionable-recheck-failed-low-load-hit-geo-regression rows for that base label (first_eval_outcome=actionable_recheck_failed_low_load_hit_geo_regression)
    • additionally prints first-eval actionable-recheck-failed lookup-not-keep+low-load-hit-geo-regression rows for that base label (first_eval_outcome=actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression)
    • additionally prints first-eval actionable rows for that base label (first_eval_outcome=actionable)
    • additionally prints first-eval pipeline-failed-lookup-not-keep rows for that base label (first_eval_outcome=pipeline_failed_lookup_not_keep)
    • additionally prints first-eval pipeline-failed rows for that base label (first_eval_outcome=pipeline_failed*)
    • additionally prints first-eval pipeline-failed-low-load-hit-geo-regression rows for that base label (first_eval_outcome=pipeline_failed_low_load_hit_geo_regression)
    • additionally prints first-eval pipeline-failed lookup-not-keep+low-load-hit-geo-regression rows for that base label (first_eval_outcome=pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression)
    • additionally prints latest-failure-family pipeline-lookup-not-keep rows for that base label (latest_failure_family=pipeline_lookup_not_keep)
    • additionally prints latest-failure-family pipeline-low-load-hit-geo-regression rows for that base label (latest_failure_family=pipeline_low_load_hit_geo_regression)
    • additionally prints latest-failure-family pipeline-lookup-not-keep+low-load-hit-geo-regression rows for that base label (latest_failure_family=pipeline_lookup_not_keep+pipeline_low_load_hit_geo_regression)
    • additionally prints latest-failure-family recheck-lookup-not-keep rows for that base label (latest_failure_family=recheck_lookup_not_keep)
    • additionally prints latest-failure-family recheck-low-load-hit-geo-regression rows for that base label (latest_failure_family=recheck_low_load_hit_geo_regression)
    • additionally prints latest-failure-family recheck-lookup-not-keep+low-load-hit-geo-regression rows for that base label (latest_failure_family=recheck_lookup_not_keep+low_load_hit_geo_regression)
    • additionally prints status-board pipeline/recheck verdict-gate early-stop unknown rows for that base label (via bench_candidate_status_board.py --only-{pipeline,recheck}-early-stop-unknown)
    • additionally prints status-board pipeline/recheck verdict-gate runs-completed known/unknown rows for that base label (via bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-{known,unknown})
    • additionally prints status-board pipeline/recheck verdict-gate runs-completed exact-3/exact-4 rows for that base label (via bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-eq {3,4})
    • additionally prints status-board pipeline/recheck verdict-gate runs-completed in-range [3,4] rows for that base label (via bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-min 3 --only-{pipeline,recheck}-verdict-gate-runs-completed-max 4)
    • additionally prints status-board pipeline lookup-gate runs-completed promotion rows for known/unknown/eq3/eq4 and in-range [3,4] slices for that base label (via --promotion-filter u64_pipeline_lookup_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus in-range bounds when using _in_range)
    • additionally prints status-board autotriage pipeline/recheck verdict-gate runs-completed promotion rows for known/unknown/eq3/eq4 and in-range [3,4] slices for that base label (via --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus in-range bounds when using _in_range)
    • additionally prints status-board autotriage first-eval pipeline/recheck verdict-gate runs-completed promotion rows for known/unknown/eq3/eq4 and in-range [3,4] slices for that base label (via --promotion-filter u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} or u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus first-eval in-range bounds when using _in_range)
    • additionally prints status-board autotriage first-eval-actionable-recheck-failed recheck verdict-gate runs-completed promotion rows for known/unknown/eq3/eq4 and in-range [3,4] slices for that base label (via --promotion-filter u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus first-eval recheck in-range bounds when using _in_range)
    • additionally prints status-board eval actionable-recheck-failed recheck verdict-gate runs-completed promotion rows for known/unknown/eq3/eq4 and in-range [3,4] slices for that base label (via --promotion-filter u64_eval_actionable_recheck_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus eval actionable-recheck-failed recheck in-range bounds when using _in_range)
    • additionally prints status-board autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed promotion rows for known/unknown/eq3/eq4 and in-range [3,4] slices for that base label (via --promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus first-eval-actionable-then-failed in-range bounds when using _in_range)
    • additionally prints status-board autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate early-stop promotion rows for {early_stopped,inconclusive,reject,unknown} slices for that base label (via --promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown})
    • additionally prints status-board autotriage first-eval pipeline/recheck verdict-gate early-stop promotion rows for {early_stopped,inconclusive,reject,unknown} slices for that base label (via --promotion-filter u64_autotriage_first_eval_{pipeline,recheck_pipeline}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown})
    • additionally prints status-board autotriage first-eval-actionable-recheck-failed recheck verdict-gate early-stop promotion rows for {early_stopped,inconclusive,reject,unknown} slices for that base label (via --promotion-filter u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown})
    • additionally prints status-board eval actionable-recheck-failed recheck verdict-gate early-stop promotion rows for {early_stopped,inconclusive,reject,unknown} slices for that base label (via --promotion-filter u64_eval_actionable_recheck_failed_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown})
    • additionally prints status-board promotion-filter rows for that base label across pipeline/eval/autotriage early-stop unknown cohorts (via bench_candidate_status_board.py --promotion-filter u64_{pipeline_lookup_gate,eval_pipeline_verdict_gate,eval_recheck_pipeline_verdict_gate,autotriage_pipeline_verdict_gate,autotriage_recheck_pipeline_verdict_gate}_early_stopped_unknown)
    • summaries are emitted after sidecar write, so base-history views include the just-produced autotriage row
  • optional global failed-reason board over latest label-base rows (HARRIER_BENCH_AUTOTRIAGE_SHOW_GLOBAL_FAILED_REASON_BOARD=1)
    • prints eval-side failure reasons, autotriage failure families, plus autotriage confirmation-state, confirmation-floor, and streak-actionable-before-failure distributions
    • additionally prints latest-failure-family pipeline-lookup-not-keep distribution (latest per base)
    • additionally prints latest-failure-family pipeline-low-load-hit-geo-regression distribution (latest per base)
    • additionally prints latest-failure-family pipeline-lookup-not-keep+low-load-hit-geo-regression distribution (latest per base)
    • additionally prints latest-failure-family recheck-lookup-not-keep distribution (latest per base)
    • additionally prints latest-failure-family recheck-low-load-hit-geo-regression distribution (latest per base)
    • additionally prints latest-failure-family recheck-lookup-not-keep+low-load-hit-geo-regression distribution (latest per base)
    • additionally prints first-eval-outcome distribution
    • additionally prints first-eval-failed distribution (latest-label-base rows where first eval failed in pipeline stage or actionable recheck)
    • additionally prints first-eval pipeline verdict-gate early-stop verdict distribution (latest-label-base rows where first eval failed and pipeline verdict-gate metadata reports early-stop)
    • additionally prints first-eval pipeline verdict-gate early-stop inconclusive/reject/unknown focused distributions (latest-label-base rows where first eval failed and pipeline verdict-gate metadata reports those verdicts)
    • additionally prints first-eval recheck verdict-gate early-stop verdict distribution (latest-label-base rows where first eval is actionable_recheck_failed* and recheck verdict-gate metadata reports early-stop)
    • additionally prints first-eval recheck verdict-gate early-stop inconclusive/reject/unknown focused distributions (latest-label-base rows where first eval is actionable_recheck_failed* and recheck verdict-gate metadata reports those verdicts)
    • additionally prints first-eval pipeline_failed* verdict-gate runs-completed distributions (latest-label-base rows filtered with --only-first-eval-pipeline-verdict-gate-runs-completed-known; known metadata by default, or include unknown buckets with HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1)
    • additionally prints first-eval pipeline_failed* verdict-gate runs-completed unknown-only boards (latest-label-base rows filtered with --only-first-eval-pipeline-verdict-gate-runs-completed-unknown)
    • additionally prints first-eval pipeline_failed* verdict-gate runs-completed exact-3 and exact-4 boards (latest-label-base rows filtered with --only-first-eval-pipeline-verdict-gate-runs-completed-eq {3,4})
    • additionally prints first-eval actionable_recheck_failed* recheck verdict-gate runs-completed distributions (latest-label-base rows filtered with --only-first-eval-recheck-verdict-gate-runs-completed-known; known metadata by default, or include unknown buckets with HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1)
    • additionally prints first-eval actionable_recheck_failed* recheck verdict-gate runs-completed unknown-only boards (latest-label-base rows filtered with --only-first-eval-recheck-verdict-gate-runs-completed-unknown)
    • additionally prints first-eval actionable_recheck_failed* recheck verdict-gate runs-completed exact-3 and exact-4 boards (latest-label-base rows filtered with --only-first-eval-recheck-verdict-gate-runs-completed-eq {3,4})
    • additionally prints eval-side pipeline_failed* verdict-gate early-stop verdict distribution (latest-label-base rows where eval outcome starts with pipeline_failed and pipeline verdict-gate metadata reports early-stop)
    • additionally prints eval-side pipeline_failed* verdict-gate runs-completed distributions (latest-label-base rows where eval outcome starts with pipeline_failed and pipeline run-count metadata is known by default; set HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1 to include unknown buckets)
    • additionally prints eval-side pipeline_failed* verdict-gate runs-completed unknown-only boards (latest-label-base rows filtered with --only-pipeline-failed-verdict-gate-runs-completed-unknown)
    • additionally prints eval-side pipeline_failed* verdict-gate runs-completed exact-3 and exact-4 boards (latest-label-base rows filtered with --pipeline-verdict-gate-runs-completed-eq {3,4})
    • additionally prints eval-side pipeline_failed* verdict-gate runs-completed configured-range boards (latest-label-base rows filtered by HARRIER_BENCH_AUTOTRIAGE_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX} when set)
    • additionally prints eval-side pipeline_failed* verdict-gate early-stop inconclusive/reject/unknown focused distributions (latest-label-base rows where eval outcome starts with pipeline_failed and pipeline verdict-gate metadata reports those verdict classes)
    • additionally prints generic eval pipeline/recheck verdict-gate early-stop unknown-only boards (latest-label-base rows filtered with --only-{pipeline,recheck}-early-stop-unknown)
    • additionally prints eval-side actionable_recheck_failed* verdict-gate early-stop verdict distribution (latest-label-base rows where eval outcome starts with actionable_recheck_failed and recheck verdict-gate metadata reports early-stop)
    • additionally prints eval-side actionable_recheck_failed* recheck verdict-gate runs-completed distributions (latest-label-base rows where eval outcome starts with actionable_recheck_failed and recheck run-count metadata is known by default; set HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1 to include unknown buckets)
    • additionally prints eval-side actionable_recheck_failed* recheck verdict-gate runs-completed unknown-only boards (latest-label-base rows filtered with --only-actionable-recheck-failed-verdict-gate-runs-completed-unknown)
    • additionally prints eval-side actionable_recheck_failed* recheck verdict-gate runs-completed exact-3 and exact-4 boards (latest-label-base rows filtered with --recheck-verdict-gate-runs-completed-eq {3,4})
    • additionally prints eval-side actionable_recheck_failed* recheck verdict-gate runs-completed configured-range boards (latest-label-base rows filtered by HARRIER_BENCH_AUTOTRIAGE_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX} when set)
    • additionally prints eval-side actionable_recheck_failed* verdict-gate early-stop inconclusive/reject/unknown focused distributions (latest-label-base rows where eval outcome starts with actionable_recheck_failed and recheck verdict-gate metadata reports those verdict classes)
    • additionally prints generic autotriage pipeline/recheck verdict-gate early-stop unknown-only boards (latest-label-base rows filtered with --only-{pipeline,recheck}-early-stop-unknown)
    • additionally prints status-board promotion-filter unknown boards for pipeline/eval/autotriage verdict-gate cohorts (latest per label base)
    • additionally prints status-board pipeline lookup-gate runs-completed promotion boards for known/unknown/eq3/eq4/in-range [3,4] slices (latest per label base, via --promotion-filter u64_pipeline_lookup_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus in-range bounds when using _in_range)
    • additionally prints status-board pipeline/recheck verdict-gate runs-completed known/unknown boards (latest per label base, via bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-{known,unknown})
    • additionally prints status-board pipeline/recheck verdict-gate runs-completed exact-3/exact-4 boards (latest per label base, via bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-eq {3,4})
    • additionally prints status-board pipeline/recheck verdict-gate runs-completed in-range [3,4] boards (latest per label base, via bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-min 3 --only-{pipeline,recheck}-verdict-gate-runs-completed-max 4)
    • additionally prints status-board autotriage pipeline/recheck verdict-gate runs-completed known/unknown promotion boards (latest per label base, via bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown})
    • additionally prints status-board autotriage pipeline/recheck verdict-gate runs-completed exact-3/exact-4 promotion boards (latest per label base, via bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_{eq3,eq4})
    • additionally prints status-board autotriage pipeline/recheck verdict-gate runs-completed in-range [3,4] promotion boards (latest per label base, via bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_in_range --only-{pipeline,recheck}-verdict-gate-runs-completed-min 3 --only-{pipeline,recheck}-verdict-gate-runs-completed-max 4)
    • additionally prints status-board autotriage first-eval pipeline/recheck verdict-gate runs-completed known/unknown/eq3/eq4 and in-range [3,4] promotion boards (latest per label base, via bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} or u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus --only-first-eval-{pipeline,recheck}-verdict-gate-runs-completed-{min,max} bounds when using _in_range)
    • additionally prints status-board autotriage first-eval-actionable-recheck-failed recheck verdict-gate runs-completed known/unknown/eq3/eq4 and in-range [3,4] promotion boards (latest per label base, via bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus --only-first-eval-recheck-verdict-gate-runs-completed-{min,max} bounds when using _in_range)
    • additionally prints status-board eval actionable-recheck-failed recheck verdict-gate runs-completed known/unknown/eq3/eq4 and in-range [3,4] promotion boards (latest per label base, via bench_candidate_status_board.py --promotion-filter u64_eval_actionable_recheck_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus --only-eval-actionable-recheck-failed-verdict-gate-runs-completed-{min,max} bounds when using _in_range)
    • additionally prints status-board autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed known/unknown/eq3/eq4 and in-range [3,4] promotion boards (latest per label base, via bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus --only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-runs-completed-{min,max} bounds when using _in_range)
    • additionally prints status-board autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate early-stop promotion boards (latest per label base) for {early_stopped,inconclusive,reject,unknown} slices (via bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown})
    • when HARRIER_BENCH_AUTOTRIAGE_SHOW_GLOBAL_FAILED_REASON_BOARD=1 and HARRIER_BENCH_AUTOTRIAGE_SHOW_EXTENDED_SUMMARIES=0, compact global output still includes first-eval-actionable-then-failed early-stop status-board promotion boards for pipeline/recheck (plus their *_unknown slices), and first-eval pipeline/recheck early-stop promotion boards (plus their *_unknown slices)
    • additionally prints status-board autotriage first-eval pipeline/recheck verdict-gate early-stop promotion boards (latest per label base) for {early_stopped,inconclusive,reject,unknown} slices (via bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_{pipeline,recheck_pipeline}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown})
    • additionally prints status-board autotriage first-eval-actionable-recheck-failed recheck verdict-gate early-stop promotion boards (latest per label base) for {early_stopped,inconclusive,reject,unknown} slices (via bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown})
    • additionally prints status-board eval actionable-recheck-failed recheck verdict-gate early-stop promotion boards (latest per label base) for {early_stopped,inconclusive,reject,unknown} slices (via bench_candidate_status_board.py --promotion-filter u64_eval_actionable_recheck_failed_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown})
    • additionally prints first-eval-lookup-not-keep distribution (latest-label-base rows where first eval indicates lookup keep failure in pipeline or actionable recheck)
    • additionally prints first-eval-low-load-hit-geo-regression distribution (latest-label-base rows where first eval indicates low-load hit-geo regression in pipeline or actionable recheck)
    • additionally prints first-eval actionable distribution (latest-label-base rows where first eval reached actionable outcome)
    • additionally prints first-eval actionable-recheck-failed distribution (latest-label-base rows where first eval already failed during actionable recheck)
    • additionally prints first-eval actionable-recheck-failed-lookup-not-keep distribution (latest-label-base rows where first eval failed in actionable recheck due to lookup keep gate)
    • additionally prints first-eval actionable-recheck-failed-low-load-hit-geo-regression distribution (latest-label-base rows where first eval failed in actionable recheck due to low-load hit-geo gate)
    • additionally prints first-eval actionable-recheck-failed lookup-not-keep+low-load-hit-geo-regression distribution (latest-label-base rows where first eval failed in actionable recheck due to combined lookup-keep and low-load hit-geo gates)
    • additionally prints first-eval pipeline-failed-lookup-not-keep distribution (latest-label-base rows where first eval failed lookup keep immediately)
    • additionally prints first-eval pipeline-failed distribution (latest-label-base rows where first eval failed in pipeline stage)
    • additionally prints first-eval pipeline-failed-low-load-hit-geo-regression distribution (latest-label-base rows where first eval failed on low-load hit-geo gate)
    • additionally prints first-eval pipeline-failed lookup-not-keep+low-load-hit-geo-regression distribution (latest-label-base rows where first eval failed both lookup keep and low-load hit-geo gates)
    • additionally prints first-eval-actionable-then-failed distribution
    • additionally prints first-eval-actionable-then-failed known-only distribution (latest-label-base rows with unknown excluded)
    • additionally prints first-eval-actionable-then-failed unknown-only distribution (latest-label-base rows where telemetry remains unknown)
    • additionally prints first-eval-actionable-then-failed yes-only distribution (latest-label-base rows where first eval was actionable but later runs failed)
    • additionally prints first-eval-actionable-then-failed yes-only pipeline/recheck verdict-gate runs-completed distributions (known-only by default; widens to include unknown run-count metadata when HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1)
    • additionally prints first-eval-actionable-then-failed yes-only pipeline/recheck verdict-gate runs-completed known-only distributions
    • additionally prints first-eval-actionable-then-failed yes-only pipeline/recheck verdict-gate runs-completed unknown-only distributions
    • additionally prints first-eval-actionable-then-failed yes-only pipeline/recheck verdict-gate runs-completed exact-3 and exact-4 distributions
    • additionally prints first-eval-actionable-then-failed yes-only pipeline/recheck verdict-gate runs-completed configured-range distributions when HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_THEN_FAILED_{PIPELINE,RECHECK}_RUNS_{MIN,MAX} is set
    • additionally prints first-eval-actionable-then-failed yes-only pipeline/recheck verdict-gate early-stop distributions (plus explicit inconclusive and reject boards)
    • additionally prints first-eval-actionable-then-failed no-only distribution (latest-label-base rows where first eval was instrumented but not actionable)
    • additionally prints latest-label-base eval pipeline_failed* pipeline verdict-gate runs-completed rows: scoped (known-only by default, widened with HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits
    • additionally prints latest-label-base eval pipeline_failed* pipeline verdict-gate runs-completed configured-range rows when HARRIER_BENCH_AUTOTRIAGE_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX} is set
    • additionally prints latest-label-base eval pipeline_failed* pipeline verdict-gate early-stopped rows (plus inconclusive/reject splits)
    • additionally prints latest-label-base eval pipeline/recheck verdict-gate early-stop unknown-only rows (filtered with --only-{pipeline,recheck}-early-stop-unknown)
    • additionally prints latest-label-base eval actionable_recheck_failed* recheck verdict-gate runs-completed rows: scoped (known-only by default, widened with HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits
    • additionally prints latest-label-base eval actionable_recheck_failed* recheck verdict-gate runs-completed configured-range rows when HARRIER_BENCH_AUTOTRIAGE_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX} is set
    • additionally prints latest-label-base eval actionable_recheck_failed* recheck verdict-gate early-stopped rows (plus inconclusive/reject splits)
    • additionally prints latest-label-base streak-instability rows (streak_actionable_before_failure=yes) (latest per normalized base label)
    • additionally prints latest-label-base first-eval-actionable-then-failed rows (first_eval_outcome=actionable and non-zero eval exit status)
    • additionally prints latest-label-base known-only first-eval-actionable-then-failed rows (yes|no, excluding unknown)
    • additionally prints latest-label-base unknown-only first-eval-actionable-then-failed rows (first_eval_actionable_then_failed=unknown)
    • additionally prints latest-label-base yes-only first-eval-actionable-then-failed rows (first_eval_actionable_then_failed=yes)
    • additionally prints latest-label-base no-only first-eval-actionable-then-failed rows (first_eval_actionable_then_failed=no)
    • additionally prints latest-label-base first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed rows: scoped (known-only by default, widens to yes-only scope when HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits
    • additionally prints latest-label-base configured-range first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed rows when HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_THEN_FAILED_{PIPELINE,RECHECK}_RUNS_{MIN,MAX} is set
    • additionally prints latest-label-base yes-only first-eval-actionable-then-failed rows where recheck verdict-gate metadata early-stopped (plus explicit inconclusive/reject row splits)
    • additionally prints latest-label-base first-eval actionable-recheck-failed rows (first_eval_outcome=actionable_recheck_failed*)
    • additionally prints latest-label-base first-eval-failed rows (first_eval_outcome starts with pipeline_failed or actionable_recheck_failed)
    • additionally prints latest-label-base first-eval pipeline verdict-gate early-stopped rows (first_eval_outcome starts with pipeline_failed or actionable_recheck_failed, plus pipeline_lookup_verdict_gate_early_stopped=yes)
    • additionally prints latest-label-base first-eval recheck verdict-gate early-stopped rows (first_eval_outcome=actionable_recheck_failed* plus recheck_pipeline_lookup_verdict_gate_early_stopped=yes)
    • additionally prints latest-label-base autotriage pipeline/recheck verdict-gate early-stop unknown-only rows (filtered with --only-{pipeline,recheck}-early-stop-unknown)
    • additionally prints latest-label-base status-board promotion-filter unknown rows for pipeline/eval/autotriage verdict-gate cohorts
    • additionally prints latest-label-base status-board pipeline lookup-gate runs-completed promotion rows for known/unknown/eq3/eq4/in-range [3,4] slices (via --promotion-filter u64_pipeline_lookup_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus in-range bounds when using _in_range)
    • additionally prints latest-label-base status-board pipeline/recheck verdict-gate runs-completed known/unknown rows (via bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-{known,unknown})
    • additionally prints latest-label-base status-board pipeline/recheck verdict-gate runs-completed exact-3/exact-4 rows (via bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-eq {3,4})
    • additionally prints latest-label-base status-board pipeline/recheck verdict-gate runs-completed in-range [3,4] rows (via bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-min 3 --only-{pipeline,recheck}-verdict-gate-runs-completed-max 4)
    • additionally prints latest-label-base status-board autotriage pipeline/recheck verdict-gate runs-completed known/unknown promotion rows (via bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown})
    • additionally prints latest-label-base status-board autotriage pipeline/recheck verdict-gate runs-completed exact-3/exact-4 promotion rows (via bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_{eq3,eq4})
    • additionally prints latest-label-base status-board autotriage pipeline/recheck verdict-gate runs-completed in-range [3,4] promotion rows (via bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_in_range --only-{pipeline,recheck}-verdict-gate-runs-completed-min 3 --only-{pipeline,recheck}-verdict-gate-runs-completed-max 4)
    • additionally prints latest-label-base status-board autotriage first-eval pipeline/recheck verdict-gate runs-completed known/unknown/eq3/eq4 and in-range [3,4] promotion rows (via bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} or u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus --only-first-eval-{pipeline,recheck}-verdict-gate-runs-completed-{min,max} bounds when using _in_range)
    • additionally prints latest-label-base status-board autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate runs-completed known/unknown/eq3/eq4 and in-range [3,4] promotion rows (via bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range} plus --only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-runs-completed-{min,max} bounds when using _in_range)
    • additionally prints latest-label-base status-board autotriage first-eval-actionable-then-failed pipeline/recheck verdict-gate early-stop promotion rows for {early_stopped,inconclusive,reject,unknown} slices (via bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown})
    • additionally prints latest-label-base status-board autotriage first-eval pipeline/recheck verdict-gate early-stop promotion rows for {early_stopped,inconclusive,reject,unknown} slices (via bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_{pipeline,recheck_pipeline}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown})
    • additionally prints latest-label-base first-eval-lookup-not-keep rows (first_eval_outcome contains lookup_not_keep)
    • additionally prints latest-label-base first-eval-low-load-hit-geo-regression rows (first_eval_outcome contains low_load_hit_geo_regression)
    • additionally prints latest-label-base first-eval actionable-recheck-failed-lookup-not-keep rows (first_eval_outcome=actionable_recheck_failed_lookup_not_keep)
    • additionally prints latest-label-base first-eval actionable-recheck-failed-low-load-hit-geo-regression rows (first_eval_outcome=actionable_recheck_failed_low_load_hit_geo_regression)
    • additionally prints latest-label-base first-eval actionable-recheck-failed lookup-not-keep+low-load-hit-geo-regression rows (first_eval_outcome=actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression)
    • additionally prints latest-label-base first-eval actionable rows (first_eval_outcome=actionable)
    • additionally prints latest-label-base first-eval pipeline-failed-lookup-not-keep rows (first_eval_outcome=pipeline_failed_lookup_not_keep)
    • additionally prints latest-label-base first-eval pipeline-failed rows (first_eval_outcome=pipeline_failed*)
    • additionally prints latest-label-base first-eval pipeline-failed-low-load-hit-geo-regression rows (first_eval_outcome=pipeline_failed_low_load_hit_geo_regression)
    • additionally prints latest-label-base first-eval pipeline-failed lookup-not-keep+low-load-hit-geo-regression rows (first_eval_outcome=pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression)
    • additionally prints latest-label-base latest-failure-family pipeline-lookup-not-keep rows (latest_failure_family=pipeline_lookup_not_keep)
    • additionally prints latest-label-base latest-failure-family pipeline-low-load-hit-geo-regression rows (latest_failure_family=pipeline_low_load_hit_geo_regression)
    • additionally prints latest-label-base latest-failure-family pipeline-lookup-not-keep+low-load-hit-geo-regression rows (latest_failure_family=pipeline_lookup_not_keep+pipeline_low_load_hit_geo_regression)
    • additionally prints latest-label-base latest-failure-family recheck-lookup-not-keep rows (latest_failure_family=recheck_lookup_not_keep)
    • additionally prints latest-label-base latest-failure-family recheck-low-load-hit-geo-regression rows (latest_failure_family=recheck_low_load_hit_geo_regression)
    • additionally prints latest-label-base latest-failure-family recheck-lookup-not-keep+low-load-hit-geo-regression rows (latest_failure_family=recheck_lookup_not_keep+low_load_hit_geo_regression)
  • by default appends a timestamp suffix to the provided label to avoid history collisions (HARRIER_BENCH_AUTOTRIAGE_APPEND_STAMP=1)
  • supports optional multi-run confirmation via HARRIER_BENCH_AUTOTRIAGE_CONFIRMATION_RUNS (default 1), with an actionable-success floor controlled by HARRIER_BENCH_AUTOTRIAGE_MIN_CONFIRMATION_RUNS_ON_SUCCESS (default 2):
    • effective run target is max(confirmation_runs, min_confirmation_runs_on_success)
    • run 1 uses the stamped candidate label
    • runs 2+ append -streakN suffixes
    • wrapper stops at the first failed eval run and reports that run's label/outcome
    • sidecar candidate_label_base remains anchored to the requested base label, while the executed streak label is recorded via last_eval_label
    • sidecar metadata includes confirmation_runs (effective target), confirmation_runs_requested, min_confirmation_runs_on_success, and confirmation_floor_applied (yes when the minimum-success floor raised effective confirmation runs above requested runs)
  • by default refuses to run when the git working tree is dirty (excluding untracked files) so candidate labels always map to committed code; override with HARRIER_BENCH_AUTOTRIAGE_ALLOW_DIRTY_TREE=1 when intentionally benchmarking uncommitted edits
  • optionally writes an autotriage sidecar report (HARRIER_BENCH_AUTOTRIAGE_WRITE_REPORT=1) to HARRIER_BENCH_AUTOTRIAGE_SUMMARY_DIR (default: benchmarks/results), containing eval exit status and latest exact-label outcome fields, including latest_confirmed_actionable_rows, latest_failure_reason, and latest_failure_family, plus run-sequence diagnostics:
    • first_eval_label, first_eval_outcome, first_eval_exit_status
    • streak_actionable_before_failure (flags unstable streaks where an earlier run was actionable before a later failure)
    • run_outcome_sequence (label:outcome:exit_status fragments joined by ;) and metadata fields for last_eval_label, confirmation_runs, and completed_eval_runs. Sidecar rows include both:
    • candidate_label (the latest executed eval label, e.g. -streak2)
    • requested_candidate_label (the originally requested autotriage label)

Override any eval env var as needed, for example:

HARRIER_BENCH_CANDIDATE_LABEL=u64-my-candidate \
HARRIER_BENCH_EVAL_DRY_RUN=1 \
bash scripts/bench_lookup_u64_candidate_autotriage.sh

Disable timestamp suffixing for fixed-label reruns:

HARRIER_BENCH_CANDIDATE_LABEL=u64-my-candidate \
HARRIER_BENCH_AUTOTRIAGE_APPEND_STAMP=0 \
bash scripts/bench_lookup_u64_candidate_autotriage.sh

To summarize candidate recheck reports (confirmation-window artifacts), use:

python3 scripts/bench_candidate_recheck_outcomes.py --kind all --limit 20

Useful filters:

  • --kind lookup|insert|all
  • --label-filter <substring>
  • --limit <N> (-1 keeps all rows)

To summarize u64 eval outcomes (including pipeline/recheck reason row counts), use:

python3 scripts/bench_u64_eval_outcomes.py --limit 20

Useful filters:

  • --label-filter <substring>
  • --label-filter-exact (requires --label-filter; exact label match)
  • --label-base-filter <substring>
  • --label-base-filter-exact (requires --label-base-filter)
  • --outcome-filter <csv>
  • --failure-family-filter <substring>
  • --failure-family-filter-exact (requires --failure-family-filter; combined pipeline/recheck lookup+low-load families accept either canonical ...+low_load_hit_geo_regression names or legacy ...+pipeline_low_load_hit_geo_regression / ...+recheck_low_load_hit_geo_regression aliases)
  • --pipeline-early-stop-verdict-filter <substring>
  • --pipeline-early-stop-verdict-filter-exact (requires --pipeline-early-stop-verdict-filter)
  • --recheck-early-stop-verdict-filter <substring>
  • --recheck-early-stop-verdict-filter-exact (requires --recheck-early-stop-verdict-filter)
  • --pipeline-verdict-gate-runs-completed-eq <N> (known numeric pipeline run-count rows with value exactly N)
  • --pipeline-verdict-gate-runs-completed-min <N> (known numeric pipeline run-count rows with value >= N)
  • --pipeline-verdict-gate-runs-completed-max <N> (known numeric pipeline run-count rows with value <= N)
  • --recheck-verdict-gate-runs-completed-eq <N> (known numeric recheck run-count rows with value exactly N)
  • --recheck-verdict-gate-runs-completed-min <N> (known numeric recheck run-count rows with value >= N)
  • --recheck-verdict-gate-runs-completed-max <N> (known numeric recheck run-count rows with value <= N)
  • --only-failed
  • --only-pipeline-low-load-skipped (rows where eval-side status-board snapshots report low-load stage skipped)
  • --only-pipeline-low-load-skipped-lookup-not-keep (subset where the skip was specifically due to lookup verdict not keep)
  • --only-pipeline-verdict-gate-early-stopped (rows whose first-pass pipeline verdict metadata reports early-stop)
  • --only-recheck-verdict-gate-early-stopped (rows whose recheck pipeline verdict metadata reports early-stop)
  • --only-pipeline-verdict-gate-runs-completed-known (rows whose first-pass pipeline verdict-gate run-count metadata is numeric/present)
  • --only-recheck-verdict-gate-runs-completed-known (rows whose recheck verdict-gate run-count metadata is numeric/present)
  • --only-pipeline-early-stop-reject
  • --only-pipeline-early-stop-inconclusive
  • --only-pipeline-early-stop-unknown
  • --only-recheck-early-stop-reject
  • --only-recheck-early-stop-inconclusive
  • --only-recheck-early-stop-unknown
  • --only-pipeline-failed-verdict-gate-early-stopped (rows where eval_outcome starts with pipeline_failed and first-pass verdict-gate metadata reports early-stop)
  • --only-pipeline-failed-verdict-gate-early-stopped-inconclusive
  • --only-pipeline-failed-verdict-gate-early-stopped-reject
  • --only-pipeline-failed-verdict-gate-early-stopped-unknown
  • --only-pipeline-failed-verdict-gate-runs-completed-known (rows where eval_outcome starts with pipeline_failed and first-pass verdict-gate run-count metadata is numeric/present)
  • --only-pipeline-failed-verdict-gate-runs-completed-unknown (rows where eval_outcome starts with pipeline_failed and first-pass verdict-gate run-count metadata is missing/unknown)
  • --only-actionable-recheck-failed-verdict-gate-early-stopped (rows where eval_outcome starts with actionable_recheck_failed and recheck verdict-gate metadata reports early-stop)
  • --only-actionable-recheck-failed-verdict-gate-early-stopped-inconclusive
  • --only-actionable-recheck-failed-verdict-gate-early-stopped-reject
  • --only-actionable-recheck-failed-verdict-gate-early-stopped-unknown
  • --only-actionable-recheck-failed-verdict-gate-runs-completed-known (rows where eval_outcome starts with actionable_recheck_failed and recheck verdict-gate run-count metadata is numeric/present)
  • --only-actionable-recheck-failed-verdict-gate-runs-completed-unknown (rows where eval_outcome starts with actionable_recheck_failed and recheck verdict-gate run-count metadata is missing/unknown)
  • --only-actionable-confirmed (rows with eval_outcome=actionable and confirmed_actionable_rows > 0)
  • --only-attributed-failures (rows with non-empty derived failure_reason)
  • --latest-label (collapse to newest eval report per candidate label)
  • --latest-label-base (collapse to newest eval report per derived base label)
  • --summary (aggregate counts by eval_outcome)
  • --summary-key eval_outcome|pipeline_final_message|recheck_pipeline_final_message|failure_reason|candidate_label_base|latest_failure_family|pipeline_lookup_verdict_gate_runs_completed|pipeline_lookup_verdict_gate_early_stopped|pipeline_lookup_verdict_gate_early_stop_verdict|recheck_pipeline_lookup_verdict_gate_runs_completed|recheck_pipeline_lookup_verdict_gate_early_stopped|recheck_pipeline_lookup_verdict_gate_early_stop_verdict
  • --limit <N> (-1 keeps all rows)

To summarize u64 autotriage sidecars, use:

python3 scripts/bench_u64_autotriage_outcomes.py --limit 20

Useful filters:

  • --label-filter <substring>
  • --label-filter-exact (requires --label-filter)
  • --label-base-filter <substring>
  • --label-base-filter-exact (requires --label-base-filter)
  • --requested-label-filter <substring>
  • --requested-label-filter-exact (requires --requested-label-filter)
  • --last-eval-label-filter <substring>
  • --last-eval-label-filter-exact (requires --last-eval-label-filter)
  • --only-failed
  • --only-confirmation-multi-run (rows with confirmation_runs > 1)
  • --only-confirmation-incomplete (rows where completed_eval_runs < confirmation_runs)
  • --only-confirmation-complete (rows where completed_eval_runs >= confirmation_runs)
  • --only-confirmation-floor-applied (rows where effective confirmation_runs > confirmation_runs_requested)
  • --only-confirmation-floor-not-applied (rows where effective confirmation_runs == confirmation_runs_requested)
  • --only-streak-actionable-before-failure (rows where streak_actionable_before_failure=yes)
  • --only-first-eval-actionable (rows where first_eval_outcome=actionable)
  • --only-first-eval-lookup-not-keep (rows where first_eval_outcome contains lookup_not_keep)
  • --only-first-eval-low-load-hit-geo-regression (rows where first_eval_outcome contains low_load_hit_geo_regression)
  • --only-first-eval-failed (rows where first_eval_outcome starts with pipeline_failed or actionable_recheck_failed)
  • --only-first-eval-actionable-recheck-failed (rows where first_eval_outcome starts with actionable_recheck_failed)
  • --only-first-eval-actionable-recheck-failed-lookup-not-keep (rows where first_eval_outcome=actionable_recheck_failed_lookup_not_keep)
  • --only-first-eval-actionable-recheck-failed-low-load-hit-geo-regression (rows where first_eval_outcome=actionable_recheck_failed_low_load_hit_geo_regression)
  • --only-first-eval-actionable-recheck-failed-lookup-not-keep-and-low-load-hit-geo-regression (rows where first_eval_outcome=actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression)
  • --only-first-eval-pipeline-failed-lookup-not-keep (rows where first_eval_outcome=pipeline_failed_lookup_not_keep)
  • --only-first-eval-pipeline-failed (rows where first_eval_outcome starts with pipeline_failed)
  • --only-first-eval-pipeline-failed-low-load-hit-geo-regression (rows where first_eval_outcome=pipeline_failed_low_load_hit_geo_regression)
  • --only-first-eval-pipeline-failed-lookup-not-keep-and-low-load-hit-geo-regression (rows where first_eval_outcome=pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression)
  • --only-first-eval-pipeline-verdict-gate-early-stopped (rows where first_eval_outcome starts with pipeline_failed and pipeline verdict-gate metadata reports early-stop)
  • --only-first-eval-pipeline-verdict-gate-early-stopped-inconclusive (rows where first_eval_outcome starts with pipeline_failed and pipeline verdict-gate early-stop verdict is inconclusive)
  • --only-first-eval-pipeline-verdict-gate-early-stopped-reject (rows where first_eval_outcome starts with pipeline_failed and pipeline verdict-gate early-stop verdict is reject)
  • --only-first-eval-pipeline-verdict-gate-early-stopped-unknown (rows where first_eval_outcome starts with pipeline_failed, pipeline verdict-gate metadata reports early-stop, and the early-stop verdict is missing/unknown)
  • --only-first-eval-recheck-verdict-gate-early-stopped (rows where first_eval_outcome starts with actionable_recheck_failed and recheck verdict-gate metadata reports early-stop)
  • --only-first-eval-recheck-verdict-gate-early-stopped-inconclusive (rows where first_eval_outcome starts with actionable_recheck_failed and recheck verdict-gate early-stop verdict is inconclusive)
  • --only-first-eval-recheck-verdict-gate-early-stopped-reject (rows where first_eval_outcome starts with actionable_recheck_failed and recheck verdict-gate early-stop verdict is reject)
  • --only-first-eval-recheck-verdict-gate-early-stopped-unknown (rows where first_eval_outcome starts with actionable_recheck_failed, recheck verdict-gate metadata reports early-stop, and the verdict is missing/unknown)
  • --only-first-eval-pipeline-verdict-gate-runs-completed-known (rows where first_eval_outcome starts with pipeline_failed and pipeline verdict-gate runs-completed metadata is numeric)
  • --only-first-eval-pipeline-verdict-gate-runs-completed-unknown (rows where first_eval_outcome starts with pipeline_failed and pipeline verdict-gate runs-completed metadata is unknown)
  • --only-first-eval-recheck-verdict-gate-runs-completed-known (rows where first_eval_outcome starts with actionable_recheck_failed and recheck verdict-gate runs-completed metadata is numeric)
  • --only-first-eval-recheck-verdict-gate-runs-completed-unknown (rows where first_eval_outcome starts with actionable_recheck_failed and recheck verdict-gate runs-completed metadata is unknown)
  • --only-first-eval-pipeline-verdict-gate-runs-completed-eq <N> (rows where first_eval_outcome starts with pipeline_failed and pipeline verdict-gate runs-completed equals <N>)
  • --only-first-eval-pipeline-verdict-gate-runs-completed-min <N> (rows where first_eval_outcome starts with pipeline_failed and pipeline verdict-gate runs-completed is >= <N>)
  • --only-first-eval-pipeline-verdict-gate-runs-completed-max <N> (rows where first_eval_outcome starts with pipeline_failed and pipeline verdict-gate runs-completed is <= <N>)
  • --only-first-eval-recheck-verdict-gate-runs-completed-eq <N> (rows where first_eval_outcome starts with actionable_recheck_failed and recheck verdict-gate runs-completed equals <N>)
  • --only-first-eval-recheck-verdict-gate-runs-completed-min <N> (rows where first_eval_outcome starts with actionable_recheck_failed and recheck verdict-gate runs-completed is >= <N>)
  • --only-first-eval-recheck-verdict-gate-runs-completed-max <N> (rows where first_eval_outcome starts with actionable_recheck_failed and recheck verdict-gate runs-completed is <= <N>)
  • --only-first-eval-actionable-then-failed (rows where first_eval_actionable_then_failed=yes)
  • --only-first-eval-actionable-then-failed-yes (rows where first_eval_actionable_then_failed=yes; explicit alias)
  • --only-first-eval-actionable-then-failed-known (rows where first_eval_actionable_then_failed is yes or no)
  • --only-first-eval-actionable-then-failed-unknown (rows where first_eval_actionable_then_failed=unknown)
  • --only-first-eval-actionable-then-failed-no (rows where first_eval_actionable_then_failed=no)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped (rows where first_eval_actionable_then_failed=yes and pipeline verdict-gate metadata reports early-stop)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-inconclusive (rows where first_eval_actionable_then_failed=yes and pipeline verdict-gate early-stop verdict is inconclusive)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-reject (rows where first_eval_actionable_then_failed=yes and pipeline verdict-gate early-stop verdict is reject)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-unknown (rows where first_eval_actionable_then_failed=yes, pipeline verdict-gate reports early-stop, and the verdict is missing/unknown)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped (rows where first_eval_actionable_then_failed=yes and recheck verdict-gate metadata reports early-stop)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-inconclusive (rows where first_eval_actionable_then_failed=yes and recheck verdict-gate early-stop verdict is inconclusive)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-reject (rows where first_eval_actionable_then_failed=yes and recheck verdict-gate early-stop verdict is reject)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-unknown (rows where first_eval_actionable_then_failed=yes, recheck verdict-gate reports early-stop, and the verdict is missing/unknown)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-known (rows where first_eval_actionable_then_failed=yes and pipeline verdict-gate runs-completed metadata is numeric)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-unknown (rows where first_eval_actionable_then_failed=yes and pipeline verdict-gate runs-completed metadata is unknown)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-eq <N> (rows where first_eval_actionable_then_failed=yes and pipeline verdict-gate runs-completed equals <N>)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-min <N> (rows where first_eval_actionable_then_failed=yes and pipeline verdict-gate runs-completed is >= <N>)
  • --only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-max <N> (rows where first_eval_actionable_then_failed=yes and pipeline verdict-gate runs-completed is <= <N>)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-known (rows where first_eval_actionable_then_failed=yes and recheck verdict-gate runs-completed metadata is numeric)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-unknown (rows where first_eval_actionable_then_failed=yes and recheck verdict-gate runs-completed metadata is unknown)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-eq <N> (rows where first_eval_actionable_then_failed=yes and recheck verdict-gate runs-completed equals <N>)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-min <N> (rows where first_eval_actionable_then_failed=yes and recheck verdict-gate runs-completed is >= <N>)
  • --only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-max <N> (rows where first_eval_actionable_then_failed=yes and recheck verdict-gate runs-completed is <= <N>)
  • --only-streak-no-actionable-before-failure (rows where streak_actionable_before_failure=no)
  • --confirmation-state-filter <substring>
  • --confirmation-state-filter-exact (requires --confirmation-state-filter)
  • --confirmation-floor-filter <substring>
  • --confirmation-floor-filter-exact (requires --confirmation-floor-filter)
  • --first-eval-outcome-filter <substring>
  • --first-eval-outcome-filter-exact (requires --first-eval-outcome-filter)
  • --streak-actionable-before-failure-filter <substring>
  • --streak-actionable-before-failure-filter-exact (requires --streak-actionable-before-failure-filter)
  • --first-eval-actionable-then-failed-filter <substring>
  • --first-eval-actionable-then-failed-filter-exact (requires --first-eval-actionable-then-failed-filter)
  • --only-actionable-confirmed (rows with latest_eval_outcome=actionable and latest_confirmed_actionable_rows > 0)
  • --failure-family-filter <substring>
  • --failure-family-filter-exact (requires --failure-family-filter)
  • --pipeline-early-stop-verdict-filter <substring>
  • --pipeline-early-stop-verdict-filter-exact (requires --pipeline-early-stop-verdict-filter)
  • --recheck-early-stop-verdict-filter <substring>
  • --recheck-early-stop-verdict-filter-exact (requires --recheck-early-stop-verdict-filter)
  • --pipeline-verdict-gate-runs-completed-eq <N>
  • --pipeline-verdict-gate-runs-completed-min <N>
  • --pipeline-verdict-gate-runs-completed-max <N>
  • --recheck-verdict-gate-runs-completed-eq <N>
  • --recheck-verdict-gate-runs-completed-min <N>
  • --recheck-verdict-gate-runs-completed-max <N>
  • --only-pipeline-verdict-gate-early-stopped
  • --only-recheck-verdict-gate-early-stopped
  • --only-pipeline-verdict-gate-runs-completed-known
  • --only-recheck-verdict-gate-runs-completed-known
  • --only-pipeline-early-stop-reject
  • --only-pipeline-early-stop-inconclusive
  • --only-pipeline-early-stop-unknown
  • --only-recheck-early-stop-reject
  • --only-recheck-early-stop-inconclusive
  • --only-recheck-early-stop-unknown
  • --only-latest-failure-family-pipeline-lookup-not-keep (rows with pipeline lookup-not-keep latest failure families, including combined pipeline lookup+low-load entries)
  • --only-latest-failure-family-pipeline-low-load-hit-geo-regression (rows with pipeline low-load hit-geo latest failure families, including combined pipeline lookup+low-load entries)
  • --only-latest-failure-family-pipeline-lookup-not-keep-and-low-load-hit-geo-regression (rows with combined pipeline lookup-not-keep + low-load-hit-geo latest failure families)
  • --only-latest-failure-family-recheck-lookup-not-keep (rows with recheck lookup-not-keep latest failure families, including combined recheck lookup+low-load entries)
  • --only-latest-failure-family-recheck-low-load-hit-geo-regression (rows with recheck low-load hit-geo latest failure families, including combined recheck lookup+low-load entries)
  • --only-latest-failure-family-recheck-lookup-not-keep-and-low-load-hit-geo-regression (rows with combined recheck lookup-not-keep + low-load-hit-geo latest failure families)
  • --latest-label
  • --latest-label-base
  • --summary
  • --summary-key latest_eval_outcome|latest_failure_reason|latest_failure_family|candidate_label_base|eval_exit_status|confirmation_runs|confirmation_runs_requested|min_confirmation_runs_on_success|confirmation_floor_applied|completed_eval_runs|requested_candidate_label|confirmation_state|last_eval_label|first_eval_outcome|streak_actionable_before_failure|run_outcome_sequence|first_eval_actionable_then_failed|pipeline_lookup_verdict_gate_runs_completed|pipeline_lookup_verdict_gate_early_stopped|pipeline_lookup_verdict_gate_early_stop_verdict|recheck_pipeline_lookup_verdict_gate_runs_completed|recheck_pipeline_lookup_verdict_gate_early_stopped|recheck_pipeline_lookup_verdict_gate_early_stop_verdict
  • --limit <N> (-1 keeps all rows)

Per-row output now also includes a derived first_eval_actionable_then_failed column (yes|no|unknown) so downstream tools can filter/aggregate the instability signal without recomputing it. Autotriage sidecars also propagate latest eval verdict-gate metadata columns (pipeline_lookup_verdict_gate_* and recheck_pipeline_lookup_verdict_gate_*) so reject vs inconclusive early-stop exits can be analyzed directly from autotriage reports.

To run a focused subset from the helper script (single op / impl / case), pass the same selectors supported by the bench binary:

HARRIER_BENCH_ONLY_OP=insert_new \
HARRIER_BENCH_ONLY_IMPL=harrier \
HARRIER_BENCH_ONLY_N=786432 \
HARRIER_BENCH_ONLY_LOAD=0.75 \
HARRIER_BENCH_ISOLATE_CASES=1 \
HARRIER_BENCH_ISOLATE_OPS=1 \
HARRIER_BENCH_ISOLATE_IMPLS=1 \
bash scripts/bench_iter.sh

HARRIER_BENCH_ONLY_OP also accepts a comma-separated list (for example: find_hit,find_hit_prehashed,find_miss,find_miss_prehashed) when you want to capture related operations in a single timestamped result file. With HARRIER_BENCH_ISOLATE_OPS=1, comma-separated op lists are split and each op is executed in its own isolated process.

HARRIER_BENCH_ONLY_IMPL also accepts comma-separated implementations (for example: harrier,hashbrown). With HARRIER_BENCH_ISOLATE_IMPLS=1, each implementation is run in its own isolated process.

You can also run a single implementation directly from the bench binary:

HARRIER_BENCH_ONLY_IMPL=harrier_u64 cargo run --release --bin bench