Harrier is a high-performance Rust hashmap project focused on:
- SIMD-accelerated control-byte probing
- two-choice cuckoo placement
- fast bounded insertion search (BFS displacement)
- robust rare-path fallback for pathological collision patterns
Current implementations:
HarrierMap<K, V>: generic map with indirect SIMD control-byte probing.HarrierU64Map<V>: specialized integer-key map under active optimization.
This repository is currently in active performance iteration.
- v0.1 goal: minimal but usable map/set API with strong correctness.
- v0.2 goal: extensive tuning and adaptive policies for state-of-the-art benchmark results.
HarrierMapuses:- indirect SIMD control-byte probing (16-byte groups)
- 2-choice cuckoo placement
- BFS relocation on insert pressure
- rare-path overflow stash fallback
- Optional reseed hook is available via
ReseedableBuildHasherfor collision recovery paths. - Lookup fast path includes deletion-aware early miss short-circuiting.
use harrier::HarrierMap;
let mut map = HarrierMap::new();
map.insert("k", 42);
assert_eq!(map.get(&"k"), Some(&42));
*map.get_or_insert("count", 0) += 1;For deterministic u64 hashing (e.g. reproducible tests/benchmarks), use the
seeded constructor on HarrierU64Map:
# use harrier::HarrierU64Map;
let mut map = HarrierU64Map::with_capacity_and_seed(1024, 0x1234_5678_9abc_def0);
map.insert(1, 10);
assert_eq!(map.get(1), Some(&10));For build-mostly workloads with guaranteed-unique keys, advanced users can use the unsafe fast path:
# use harrier::HarrierMap;
# let mut map = HarrierMap::with_capacity(128);
for i in 1..=100u64 {
// SAFETY: keys are unique in this loop.
unsafe { map.insert_unique_unchecked(i, i * 2); }
}If capacity is already reserved and you want to skip per-insert growth checks,
use insert_unique_unchecked_no_grow:
# use harrier::HarrierMap;
# let mut map = HarrierMap::with_capacity(256);
for i in 1..=200u64 {
// SAFETY: keys are unique and capacity is pre-reserved.
unsafe { map.insert_unique_unchecked_no_grow(i, i * 2); }
}cargo testcargo run --release --bin benchOptionally tune iterations:
HARRIER_BENCH_ITERS=1000000 cargo run --release --bin benchYou can also tune sample count for median-based reporting:
HARRIER_BENCH_ITERS=200000 HARRIER_BENCH_RUNS=3 cargo run --release --bin benchTo median across repeated benchmark processes (useful for taming process-level jitter), use the helper script knob:
HARRIER_BENCH_PROCESS_REPEATS=3 bash scripts/bench_iter.shWhen process repeats are enabled, the script prints a per-row repeat-stability table (CV/min/max/median across repeats) before writing the median-aggregated TSV. You can tune the table length with:
HARRIER_BENCH_PROCESS_REPEATS=3 HARRIER_BENCH_PROCESS_REPEAT_STABILITY_LIMIT=20 bash scripts/bench_iter.shIf lookup-path diagnostics are enabled (HARRIER_BENCH_LOOKUP_STATS=1),
process-repeat runs also print a Lookup path stability table from stderr logs
showing median/CV for key path-mix rates:
HARRIER_BENCH_PROCESS_REPEATS=3 \
HARRIER_BENCH_LOOKUP_STATS=1 \
HARRIER_BENCH_PROCESS_REPEAT_LOOKUP_STABILITY_LIMIT=20 \
bash scripts/bench_iter.shIf insert phase timing is enabled, process-repeat runs also print an
Insert phase stability table based on stderr phase logs:
HARRIER_BENCH_PROCESS_REPEATS=3 \
HARRIER_BENCH_INSERT_PHASE_TIMING=1 \
HARRIER_BENCH_PROCESS_REPEAT_PHASE_STABILITY_LIMIT=20 \
bash scripts/bench_iter.shIf insert checkpoint timing is also enabled, process-repeat runs print an
Insert checkpoint stability table summarizing tail_head_ratio and
per-segment CVs across repeats:
HARRIER_BENCH_PROCESS_REPEATS=3 \
HARRIER_BENCH_INSERT_PHASE_TIMING=1 \
HARRIER_BENCH_INSERT_PHASE_CHECKPOINTS=3 \
HARRIER_BENCH_PROCESS_REPEAT_CHECKPOINT_STABILITY_LIMIT=20 \
bash scripts/bench_iter.shIt also prints a repeat-synchrony table to help identify whether outlier
repeats are shared across implementations (environmental) or impl-local.
Default outlier mode is robust MAD (HARRIER_BENCH_PROCESS_REPEAT_EVENT_MODE=mad).
Tune synchrony output with:
HARRIER_BENCH_PROCESS_REPEATS=3 \
HARRIER_BENCH_PROCESS_REPEAT_SYNC_LIMIT=20 \
HARRIER_BENCH_PROCESS_REPEAT_EVENT_MODE=mad \
HARRIER_BENCH_PROCESS_REPEAT_MAD_Z=3.5 \
HARRIER_BENCH_PROCESS_REPEAT_SPIKE_RATIO=1.15 \
bash scripts/bench_iter.shSupported event modes:
mad(default): two-sided robust outlier detection around median.ratio: high-side threshold usingPROCESS_REPEAT_SPIKE_RATIO.quantile: two-sided tail detection usingPROCESS_REPEAT_QUANTILE_CUTOFF.
When multiple comparable process-repeat runs exist, the script also prints a cross-run persistence summary of shared event indices (which repeat positions keep co-spiking across runs):
HARRIER_BENCH_PROCESS_REPEATS=5 \
HARRIER_BENCH_PROCESS_REPEAT_PERSIST_LIMIT=20 \
bash scripts/bench_iter.shIf phase timing is enabled, it also prints a cross-run phase-alignment
persistence table that joins repeat JSON + stderr phase logs and reports how
often shared total-cost events align with setup (clear+pretouch) vs measured
insert events. When checkpoint timing is enabled, the same table also reports
tail_head_ratio shared-rate/overlap columns and:
alignment_class(setup_dominant,measured_uniform,measured_tail_skew,tail_skew_only,mixed,none)measured_outlier_shape(setup_dominant,whole_loop_scaling,segment_skew,tail_skew_without_measured_alignment,measured_inconclusive,none) to quickly distinguish segment-skew outliers from whole-loop scaling:
When insert rows are present, the same report now appends
# insert_phase_verdict comment lines with a coarse insert algorithm gate:
blocked_noisy_shared_signal, blocked_environmental_scaling,
allow_segment_targeting, or inconclusive_alignment.
HARRIER_BENCH_PROCESS_REPEATS=5 \
HARRIER_BENCH_INSERT_PHASE_TIMING=1 \
HARRIER_BENCH_PROCESS_REPEAT_PHASE_PERSIST_LIMIT=20 \
bash scripts/bench_iter.shTo keep each raw repeat TSV (instead of deleting temporary repeat files), add:
HARRIER_BENCH_PROCESS_REPEATS=3 HARRIER_BENCH_KEEP_REPEAT_FILES=1 bash scripts/bench_iter.shBy default, process-repeat runs also write a consolidated JSON sidecar
(<stamp>-<mode>.repeats.json) containing per-row repeat series and summary
stats. Disable it with:
HARRIER_BENCH_PROCESS_REPEATS=3 HARRIER_BENCH_REPEAT_JSON=0 bash scripts/bench_iter.shOptional warmup loop control:
HARRIER_BENCH_WARMUP_ITERS=50000 cargo run --release --bin benchOptional timed-insert warmup runs (helps reduce first-allocation noise in
insert_new / insert_update benchmarks):
HARRIER_BENCH_INSERT_WARMUP_RUNS=1 cargo run --release --bin benchOptional per-case internal repeats for insert benchmarks (medianed before each reported sample):
HARRIER_BENCH_INSERT_CASE_REPEATS=3 cargo run --release --bin benchOptional map reuse mode for insert benchmarks (reuses one pre-allocated map per measured run and clears between case repeats):
HARRIER_BENCH_INSERT_REUSE_MAP=1 cargo run --release --bin benchOptional insert pre-touch mode (runs an untimed fill+clear before each measured
insert_new sample to reduce allocator/page-fault spikes):
HARRIER_BENCH_INSERT_PRETOUCH=1 cargo run --release --bin benchOptional insert-path diagnostics (prints BFS displacement counters to stderr for
harrier/harrier_u64 insert_new benchmarks):
HARRIER_BENCH_INSERT_STATS=1 cargo run --release --bin benchOptional lookup-path diagnostics (prints primary/secondary probe and key-compare
counters to stderr for find_hit/find_miss on harrier and harrier_u64):
HARRIER_BENCH_LOOKUP_STATS=1 cargo run --release --bin benchLookup diagnostic fields:
primary_group_probes/secondary_group_probes: main and alternate-group probe countstag_matches: total control-tag candidate matches across probed groupskey_comparisons: total key equality checks performed for matched tagssecondary_hits: successful lookups resolved in alternate groupsoverflow_lookups: fallback overflow scans performed after main-table missoverflow_matches: successful lookups resolved from overflow fallback- normalized rates:
secondary_probe_ratetag_matches_per_lookupkey_comparisons_per_lookupkey_comparisons_per_primarysecondary_hit_rateoverflow_lookup_rateoverflow_match_rate.
When running through scripts/bench_iter.sh, lookup diagnostics also power:
Lookup path stability(repeat-window median/CV over lookup path rates)Lookup stats drift vs previous run(metric deltas vs previous comparable run)Lookup drift signal persistence(rolling counts of stable speedups/regressions across recent comparable lookup-drift sidecars) usingscripts/bench_lookup_path_stability.pyandscripts/bench_lookup_diff.py(includesns_delta_pctand apath_mix_stableflag when lookup-path rates are unchanged, plusns_trendandstable_speedup/stable_regression/signal_classclassifications. It also emits relative-vs-hashbrown drift fields (old_rel_vs_hashbrown,new_rel_vs_hashbrown,rel_delta_pct,rel_trend,stable_relative_speedup,stable_relative_regression,rel_signal_class) plus a short summary line counting absolute and relative stable speedups/regressions). Path-mix deltas now include overflow fallback rates (overflow_lookup_rate_delta,overflow_match_rate_delta) so overflow-driven lookup regressions are explicitly classified as path shifts. When previous comparable lookup diagnostics are available,bench_iter.shalso persists this drift table as a sidecar:<stamp>-<mode>.lookup_diff.tsvTune drift stability sensitivity with:HARRIER_BENCH_LOOKUP_PATH_EPS(default1e-9). Tune ns/op trend deadzone with:HARRIER_BENCH_LOOKUP_NS_EPS_PCT(default0.0). Tune persistence table length with:HARRIER_BENCH_LOOKUP_SIGNAL_PERSIST_LIMIT(default12). The persistence table (scripts/bench_lookup_signal_persistence.py) now reports both absolute and relative-vs-hashbrown stable signal counts (net_stable_scoreandnet_relative_score), and sorts by relative regression persistence first. It also appends# op_relative_summarycomment lines that aggregate relative drift persistence by operation (useful for trackingfind_hitkeep/revert pressure across windows). When at least three lookup drift sidecars are available,bench_iter.shalso prints aLookup relative verdicttable (viascripts/bench_lookup_relative_verdict.py) that summarizes per-impl and combined Harrier relative signal counts for one operation and emits akeep/reject/inconclusiveverdict. Tune that verdict with:HARRIER_BENCH_LOOKUP_VERDICT_OP(defaultfind_hit)HARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS(default3)HARRIER_BENCH_LOOKUP_VERDICT_MIN_NET(default1)HARRIER_BENCH_LOOKUP_VERDICT_MAX_REGRESSIONS(default1)HARRIER_BENCH_LOOKUP_VERDICT_MAX_PATH_SHIFT(default1).
Optional contains-operation benchmarks (contains_hit / contains_miss) can be
enabled for lookup-path analysis without value-load costs:
HARRIER_BENCH_INCLUDE_CONTAINS=1 cargo run --release --bin benchHit-style operations (find_hit, contains_hit, and find_hit_prehashed)
cycle through keys 1..=n, so they remain true-hit workloads even when n is
not a power of two.
Optional prehashed lookup benchmarks (find_hit_prehashed /
find_miss_prehashed) can be enabled to run lookups with caller-supplied
precomputed hashes (for both Harrier and hashbrown):
HARRIER_BENCH_INCLUDE_PREHASHED=1 cargo run --release --bin benchWhen run via scripts/bench_iter.sh, prehashed mode also prints a
find_* vs find_*_prehashed delta table (via
scripts/bench_lookup_decompose.py) so you can quickly inspect how each
implementation changes under prehashed lookup mode.
Diagnostic fields:
direct_primary_inserts: inserts placed directly in primary group (g0)direct_secondary_inserts: inserts placed directly in alternate group (g1)direct_primary_rate/direct_secondary_rate: normalized direct-placement ratiosone_step_searches: insertions that attempted the one-step displacement fast pathone_step_slots_scanned: candidate slots inspected while searching one-step movesone_step_duplicate_child_groups: duplicate one-step child alternates skipped within a root-group scanone_step_hits: insertions resolved directly via one-step displacementone_step_slots_per_searchandone_step_hit_rate: normalized one-step fast-path efficiency ratiosbfs_searches: number of insertions that entered the BFS displacement pathbfs_groups_scanned: total BFS parent groups expandedbfs_duplicate_child_groups: duplicate child alternates skipped per parent expansionbfs_displacements: number of moved entries along discovered displacement pathsbfs_rate: fraction of inserts that entered BFS searchgroups_per_searchanddisplacements_per_insert: normalized ratios helpful for correlating runtime outliers with algorithmic behavior.overflow_stashes/overflow_rate: rare fallback count and fraction.
Optional insert phase timing diagnostics (prints clear / pretouch-fill /
timed-insert phase costs to stderr for all insert_new implementations):
HARRIER_BENCH_INSERT_PHASE_TIMING=1 cargo run --release --bin benchOptional finer measured-insert checkpoint timing (split measured insert loop
into checkpoints+1 timed segments; emits insert_phase_ckpt diagnostics):
HARRIER_BENCH_INSERT_PHASE_TIMING=1 \
HARRIER_BENCH_INSERT_PHASE_CHECKPOINTS=3 \
cargo run --release --bin benchinsert_phase_ckpt output includes per-segment segN_ns_per_insert plus
head_ns_per_insert, tail_ns_per_insert, and tail_head_ratio to help
localize whether late-loop segments dominate unstable runs.
When running through scripts/bench_iter.sh, stderr is persisted to
benchmarks/results/<stamp>-<mode>.stderr.log.
Optional deterministic generic-hasher mode (reduces run-to-run seed variance
for HarrierMap and hashbrown comparisons):
HARRIER_BENCH_FIXED_HASHER=1 cargo run --release --bin benchUse the helper script to capture timestamped benchmark files and diff against the previous run:
bash scripts/bench_iter.shbench_iter.sh takes an exclusive lock in benchmarks/results/.bench_iter.lock
when flock is available, so concurrent helper invocations are serialized
instead of contaminating each other’s measurements.
It also sets PYTHONDONTWRITEBYTECODE=1 for helper Python scripts to avoid
generating transient __pycache__ files during benchmark runs.
This writes results to benchmarks/results/*.tsv and prints per-case percent
delta vs the previous run in the same hasher mode (-default.tsv vs
-fixed.tsv). It also prints a rolling median summary and instability table
over the most recent 5 runs with matching metadata (falls back to same-mode
runs when no matching metadata is found). Each run writes a sidecar *.meta.json
capturing benchmark knob settings (plus a benchmark schema version) for traceability. When metadata is present,
the script prefers comparing against the most recent prior run with
matching benchmark knobs. If you explicitly want legacy fallback behavior
(compare to previous timestamp when no metadata match exists), set
HARRIER_BENCH_ALLOW_FALLBACK_PREV=1 (this includes insert-specific knobs such
as HARRIER_BENCH_INSERT_CASE_REPEATS, reuse-map, pre-touch settings, insert
stats mode, insert phase timing mode, process-repeats mode (including repeat
stability/synchrony knobs, event mode/threshold knobs, repeat persistence knob,
repeat JSON export mode, repeat phase-stability/persistence limit knobs, and
repeat lookup-stability limit knob, repeat checkpoint-stability limit knob, repeat insert checkpoint timing knob,
lookup-stats + lookup-path-eps + lookup-ns-eps-pct + lookup-signal-persist-limit + include-contains + include-prehashed knobs, repeat-file retention mode), and
targeted filters like HARRIER_BENCH_ONLY_OP / HARRIER_BENCH_ONLY_IMPL).
For less cross-case interference while benchmarking, you can isolate each
(n, load) case into a separate process:
HARRIER_BENCH_ISOLATE_CASES=1 bash scripts/bench_iter.shFor maximum isolation (each operation in a separate process per (n, load)),
enable op isolation too:
HARRIER_BENCH_ISOLATE_CASES=1 HARRIER_BENCH_ISOLATE_OPS=1 bash scripts/bench_iter.shTo isolate each implementation (harrier, harrier_u64, hashbrown) into its
own process as well, add:
HARRIER_BENCH_ISOLATE_CASES=1 HARRIER_BENCH_ISOLATE_OPS=1 HARRIER_BENCH_ISOLATE_IMPLS=1 bash scripts/bench_iter.shTo reduce thermal/time-order bias in isolated mode, shuffle (n, load) case
order:
HARRIER_BENCH_ISOLATE_CASES=1 HARRIER_BENCH_SHUFFLE_CASES=1 bash scripts/bench_iter.shFor reproducible shuffled order across runs, set a shuffle seed:
HARRIER_BENCH_ISOLATE_CASES=1 HARRIER_BENCH_SHUFFLE_CASES=1 HARRIER_BENCH_SHUFFLE_SEED=123 bash scripts/bench_iter.shTo reduce scheduler noise further, optionally pin benchmark processes to one CPU and/or adjust priority:
HARRIER_BENCH_TASKSET_CPU=2 HARRIER_BENCH_NICE=-5 bash scripts/bench_iter.shFor targeted insert stability studies, CPU pinning is especially useful (in our
isolated insert_new runs it reduced CV from double-digit percentages to
sub-1% in repeated samples).
For a standardized pinned lookup diagnostics gate (fixed hasher, process repeats, lookup stats + contains + prehashed ops on the large-case tuple), use:
bash scripts/bench_lookup_gate.shAll knobs in this helper remain overridable via environment variables.
By default it also sets HARRIER_BENCH_LOOKUP_NS_EPS_PCT=0.2 so drift
classification treats sub-0.2% ns/op movement as flat, and
HARRIER_BENCH_LOOKUP_PATH_EPS=0.002 so tiny path-rate jitter is classified
as path-mix stable. It also sets default relative verdict knobs
(HARRIER_BENCH_LOOKUP_VERDICT_OP=find_hit,
HARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS=3,
HARRIER_BENCH_LOOKUP_VERDICT_MIN_NET=1,
HARRIER_BENCH_LOOKUP_VERDICT_MAX_REGRESSIONS=1,
HARRIER_BENCH_LOOKUP_VERDICT_MAX_PATH_SHIFT=1).
For an automated multi-run lookup decision gate that repeatedly executes the
lookup helper and then prints a final relative verdict (keep/reject/
inconclusive) for one operation, use:
bash scripts/bench_lookup_verdict_gate.shUseful knobs:
HARRIER_BENCH_VERDICT_GATE_RUNS(default3)HARRIER_BENCH_LOOKUP_VERDICT_LIMIT(default5; rolling diff files used)HARRIER_BENCH_LOOKUP_VERDICT_SCOPE(defaultcurrent_window; set tomatching_historyto compute verdict from the rolling comparable history instead of only the diff sidecars generated by the current gate invocation)HARRIER_BENCH_LOOKUP_VERDICT_OP(defaultfind_hit)HARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS(default3)HARRIER_BENCH_LOOKUP_VERDICT_MIN_NET(default1)HARRIER_BENCH_LOOKUP_VERDICT_MAX_REGRESSIONS(default1)HARRIER_BENCH_LOOKUP_VERDICT_MAX_PATH_SHIFT(default1)HARRIER_BENCH_VERDICT_FAIL_ON_REJECT(default0; set to1to exit non-zero on a combined reject verdict).HARRIER_BENCH_VERDICT_FAIL_ON_INSUFFICIENT_DATA(default0; set to1to fail when combined verdict isinsufficient_data/unknown)HARRIER_BENCH_VERDICT_FAIL_ON_INCONCLUSIVE(default0; set to1to fail when combined verdict isinconclusive)HARRIER_BENCH_VERDICT_FAIL_ON_VERDICT_DEGRADE(default0; set to1to fail when the combined verdict rank worsens versus the previous matching verdict sidecar).HARRIER_BENCH_VERDICT_FAIL_ON_NET_DROP(default0; set to1to fail when the combined net relative score drops versus the previous matching verdict sidecar).HARRIER_BENCH_VERDICT_NET_DROP_MIN(default1; minimum drop threshold used by...FAIL_ON_NET_DROP).HARRIER_BENCH_VERDICT_FAIL_ON_PATH_SHIFT(default0; set to1to fail when combined relative path-shift runs exceed the configured max)HARRIER_BENCH_VERDICT_PATH_SHIFT_MAX(default0; maximum allowed combined relative path-shift runs when fail-on-path-shift is enabled)HARRIER_BENCH_VERDICT_EARLY_STOP_ON_NON_KEEP(default1; after each gate run onceHARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNSis reached, compute a provisional combined verdict on currently selected lookup-diff sidecars and stop early when that verdict is already terminal under active fail flags (reject/inconclusive/insufficient-data/unknown))HARRIER_BENCH_CANDIDATE_LABEL(default empty; optional label recorded in verdict sidecars/log output for experiment traceability) The helper also writes a sidecar verdict report next to the latest fixed TSV as<stamp>-fixed.lookup_verdict.tsv. If a previous matching verdict sidecar exists, it also prints and records the combined verdict delta (previous -> current). The verdict sidecar also records the activeverdict_scopeand the count of selected lookup-diff sidecars used to compute the verdict, plusgit_commit,candidate_label,verdict_gate_runs_requested,verdict_gate_runs_completed,verdict_gate_early_stopped,verdict_gate_early_stop_verdict, andcombined_path_shift_runsfor run traceability. If no lookup-diff sidecars are available/selected, it writes aninsufficient_dataverdict sidecar and honorsHARRIER_BENCH_VERDICT_FAIL_ON_INSUFFICIENT_DATA.
For strict candidate experimentation (auto-fail on reject, inconclusive, insufficient data, verdict degradation, or net-score drops across scoped windows), use:
bash scripts/bench_lookup_candidate_gate.shThis wrapper defaults to:
HARRIER_BENCH_LOOKUP_VERDICT_SCOPE=current_windowHARRIER_BENCH_VERDICT_GATE_RUNS=3HARRIER_BENCH_LOOKUP_VERDICT_LIMIT=$HARRIER_BENCH_VERDICT_GATE_RUNSHARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS=$HARRIER_BENCH_VERDICT_GATE_RUNSHARRIER_BENCH_ONLY_OP=find_hit(focused single-op candidate loop)HARRIER_BENCH_INCLUDE_CONTAINS=0HARRIER_BENCH_INCLUDE_PREHASHED=0HARRIER_BENCH_VERDICT_FAIL_ON_REJECT=1HARRIER_BENCH_VERDICT_FAIL_ON_INSUFFICIENT_DATA=1HARRIER_BENCH_VERDICT_FAIL_ON_INCONCLUSIVE=1HARRIER_BENCH_VERDICT_FAIL_ON_VERDICT_DEGRADE=1HARRIER_BENCH_VERDICT_FAIL_ON_NET_DROP=1HARRIER_BENCH_VERDICT_NET_DROP_MIN=1HARRIER_BENCH_VERDICT_FAIL_ON_PATH_SHIFT=1HARRIER_BENCH_VERDICT_PATH_SHIFT_MAX=0HARRIER_BENCH_CANDIDATE_LABEL=lookup-candidate-<git-short-sha>(unless explicitly overridden)
To run the same strict gate focused on a single implementation, use:
bash scripts/bench_lookup_candidate_gate_generic.sh
bash scripts/bench_lookup_candidate_gate_u64.shThese wrappers set:
HARRIER_BENCH_ONLY_IMPL=harrier,hashbrown(generic wrapper) orHARRIER_BENCH_ONLY_IMPL=harrier_u64,hashbrown(u64 wrapper)HARRIER_BENCH_VERDICT_GATE_RUNS=4HARRIER_BENCH_LOOKUP_VERDICT_LIMIT=3HARRIER_BENCH_LOOKUP_VERDICT_MIN_RUNS=3(uses one extra run to produce a full 3 diff sidecars in fresh impl-scoped windows)- default labels:
lookup-candidate-generic-<git-short-sha>lookup-candidate-u64-<git-short-sha>
To evaluate both impl-focused strict wrappers as one candidate step, use:
bash scripts/bench_lookup_candidate_dual_gate.shThis runs:
bench_lookup_candidate_gate_generic.shwith label suffix-genericbench_lookup_candidate_gate_u64.shwith label suffix-u64
and writes <stamp>-lookup-candidate-dual.tsv with per-scope runner status,
combined verdict, net score, and path-shift runs. By default it runs both scopes and then fails if any scope
wrapper failed (HARRIER_BENCH_CANDIDATE_REQUIRE_SUCCESS=1).
Set HARRIER_BENCH_CANDIDATE_FAIL_FAST=1 to exit immediately on the first
failed scope.
For confirmation windows on top of the dual candidate gate, use:
bash scripts/bench_lookup_candidate_dual_recheck_gate.shDefaults:
HARRIER_BENCH_LOOKUP_DUAL_CONFIRM_WINDOWS=2HARRIER_BENCH_LOOKUP_DUAL_REQUIRE_PASS=1HARRIER_BENCH_CANDIDATE_LABEL=lookup-dual-confirm-<git-short-sha>(appends-w1,-w2, ...)HARRIER_BENCH_LOOKUP_DUAL_RECHECK_REPORT_DIR=benchmarks/results(writes<stamp>-lookup-candidate-dual-recheck.tsv) and records both runner status (wrapper exit) and final dual status
To summarize dual candidate and dual recheck sidecars, use:
python3 scripts/bench_lookup_dual_outcomes.py --kind all --limit 20Useful filters:
--kind dual|dual_recheck|all--label-filter <substring>--limit <N>(-1keeps all rows)--view raw|latest_label|label_stats(latest_label/label_statsinclude latest git commit metadata)
For stricter promotion discipline, run confirmation windows that repeatedly
invoke the strict lookup candidate gate and require combined_verdict=keep in
each window:
bash scripts/bench_lookup_candidate_recheck_gate.shThis helper defaults to:
HARRIER_BENCH_CANDIDATE_CONFIRM_WINDOWS=2HARRIER_BENCH_CANDIDATE_REQUIRE_KEEP=1HARRIER_BENCH_CANDIDATE_LABEL=lookup-candidate-confirm-<git-short-sha>(it appends-w1,-w2, ... per window)HARRIER_BENCH_CANDIDATE_RECHECK_REPORT_DIR=benchmarks/results(writes<stamp>-lookup-candidate-recheck.tsvwith per-window outcomes + final pass/fail status)
Useful overrides:
- set
HARRIER_BENCH_CANDIDATE_CONFIRM_WINDOWShigher for deeper confirmation. - set
HARRIER_BENCH_CANDIDATE_REQUIRE_KEEP=0to run confirmation windows without enforcing keep (diagnostic mode).
For a standardized pinned insert checkpoint gate (fixed hasher, insert_new,
reuse+pretouch, phase timing + checkpoints, process repeats), use:
bash scripts/bench_insert_checkpoint_gate.shAll knobs in this helper are also overridable via environment variables.
For an automated multi-run insert verdict gate that repeatedly executes the
checkpoint helper and then prints the latest # insert_phase_verdict gate
classification, use:
bash scripts/bench_insert_verdict_gate.shUseful knobs:
HARRIER_BENCH_INSERT_VERDICT_GATE_RUNS(default3)HARRIER_BENCH_INSERT_VERDICT_LIMIT(default5; rolling repeat windows)HARRIER_BENCH_INSERT_VERDICT_SCOPE(defaultcurrent_window; set tomatching_historyto compute verdict from comparable historical windows)HARRIER_BENCH_INSERT_VERDICT_LIMIT_ROWS(default5)HARRIER_BENCH_INSERT_VERDICT_FAIL_ON_BLOCK(default0; set to1to exit non-zero when verdict is blocked/inconclusive)HARRIER_BENCH_INSERT_VERDICT_FAIL_ON_UNKNOWN(default0; set to1to fail when no insert phase gate could be extracted)HARRIER_BENCH_CANDIDATE_LABEL(default empty; optional label recorded in verdict sidecars/log output for experiment traceability)HARRIER_BENCH_PROCESS_REPEAT_EVENT_MODE/..._SPIKE_RATIO/..._MAD_Z/..._QUANTILE_CUTOFFfor persistence event detection mode. The helper also writes a sidecar verdict report next to the latest fixed TSV as<stamp>-fixed.insert_verdict.tsv, includingverdict_scopeand selected repeat-window counts,git_commit,candidate_label, and (when available) previous gate + gate delta. If no repeat sidecars are available/selected, it writes anunknowngate sidecar and honorsHARRIER_BENCH_INSERT_VERDICT_FAIL_ON_UNKNOWN.
For strict insert-candidate gating (auto-fail on blocked or unknown gate), use:
bash scripts/bench_insert_candidate_gate.shThis wrapper defaults to:
HARRIER_BENCH_INSERT_VERDICT_SCOPE=current_windowHARRIER_BENCH_INSERT_VERDICT_GATE_RUNS=3HARRIER_BENCH_INSERT_VERDICT_LIMIT=$HARRIER_BENCH_INSERT_VERDICT_GATE_RUNSHARRIER_BENCH_INSERT_VERDICT_FAIL_ON_BLOCK=1HARRIER_BENCH_INSERT_VERDICT_FAIL_ON_UNKNOWN=1HARRIER_BENCH_CANDIDATE_LABEL=insert-candidate-<git-short-sha>(unless explicitly overridden)
For insert-side confirmation windows (re-running strict insert candidate gates
and requiring a specific insert_phase_gate each time), use:
bash scripts/bench_insert_candidate_recheck_gate.shThis helper defaults to:
HARRIER_BENCH_INSERT_CANDIDATE_CONFIRM_WINDOWS=2HARRIER_BENCH_INSERT_CANDIDATE_REQUIRE_GATE=allow_segment_targetingHARRIER_BENCH_CANDIDATE_LABEL=insert-candidate-confirm-<git-short-sha>(it appends-w1,-w2, ... per window)HARRIER_BENCH_INSERT_CANDIDATE_RECHECK_REPORT_DIR=benchmarks/results(writes<stamp>-insert-candidate-recheck.tsvwith per-window gates + final pass/fail status)
Useful overrides:
- set
HARRIER_BENCH_INSERT_CANDIDATE_CONFIRM_WINDOWShigher for deeper confirmation windows. - set
HARRIER_BENCH_INSERT_CANDIDATE_REQUIRE_GATE=(empty) to disable gate equality enforcement and use it as a diagnostics runner.
To inspect recent candidate verdict outcomes across lookup/insert sidecars in a single TSV summary, use:
python3 scripts/bench_candidate_outcomes.py --kind all --limit 30Useful filters:
--kind lookup|insert|all--label-filter <substring>--limit <N>(-1keeps all rows)--view raw|latest_label|label_stats--lookup-scope combined|harrier_combined|harrier|harrier_u64(lookup rows can be summarized from any verdict scope, defaultcombined)latest_labelcollapses to newest sidecar perkind+candidate_labelwith asample_countcolumn, latestgit_commit, and selected lookup scope (plus insert alignment/outlier-shape columns for insert sidecars).label_statsreports per-label outcome counts (keep/reject/...) plus the latest outcome/commit/scope/path for each label (plus insert alignment/outlier-shape fields when available).
To build a cross-script candidate status board (lookup/insert verdicts, recheck outcomes, and dual gate outcomes) use:
python3 scripts/bench_candidate_status_board.py --limit 50Useful filters:
--label-filter <substring>--label-filter-exact(requires--label-filter; use exact label matches instead of substring matching)--label-base-filter <substring>--label-base-filter-exact(requires--label-base-filter; matches normalized base labels, stripping-recheckand timestamp suffixes; when available, eval/autotriage metadatacandidate_label_basevalues are used)--requested-label-filter <substring>(filters onu64_autotriage_requested_candidate_label; useful for confirmation-streak triage where executed labels may include-streakN)--requested-label-filter-exact(requires--requested-label-filter)--last-eval-label-filter <substring>(filters onu64_autotriage_last_eval_label)--last-eval-label-filter-exact(requires--last-eval-label-filter)--first-eval-outcome-filter <substring>(filters onu64_autotriage_first_eval_outcome)--first-eval-outcome-filter-exact(requires--first-eval-outcome-filter)--streak-actionable-before-failure-filter <substring>(filters onu64_autotriage_streak_actionable_before_failure)--streak-actionable-before-failure-filter-exact(requires--streak-actionable-before-failure-filter)--first-eval-actionable-then-failed-filter <substring>(filters on derivedu64_autotriage_first_eval_actionable_then_failed=yes|no|unknown)--first-eval-actionable-then-failed-filter-exact(requires--first-eval-actionable-then-failed-filter)--latest-failure-family-filter <substring>(filters on derivedu64_autotriage_latest_failure_family)--latest-failure-family-filter-exact(requires--latest-failure-family-filter; combined pipeline/recheck lookup+low-load families accept both canonical...+low_load_hit_geo_regressionand legacy...+pipeline_low_load_hit_geo_regression/...+recheck_low_load_hit_geo_regressionaliases)--only-first-eval-actionable(keeps rows whereu64_autotriage_first_eval_outcome=actionable)--only-first-eval-lookup-not-keep(keeps rows whereu64_autotriage_first_eval_outcomecontainslookup_not_keep)--only-first-eval-low-load-hit-geo-regression(keeps rows whereu64_autotriage_first_eval_outcomecontainslow_load_hit_geo_regression)--only-first-eval-failed(keeps rows whereu64_autotriage_first_eval_outcomestarts withpipeline_failedoractionable_recheck_failed)--only-first-eval-actionable-recheck-failed(keeps rows whereu64_autotriage_first_eval_outcomestarts withactionable_recheck_failed)--only-first-eval-actionable-recheck-failed-lookup-not-keep(keeps rows whereu64_autotriage_first_eval_outcome=actionable_recheck_failed_lookup_not_keep)--only-first-eval-actionable-recheck-failed-low-load-hit-geo-regression(keeps rows whereu64_autotriage_first_eval_outcome=actionable_recheck_failed_low_load_hit_geo_regression)--only-first-eval-actionable-recheck-failed-lookup-not-keep-and-low-load-hit-geo-regression(keeps rows whereu64_autotriage_first_eval_outcome=actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression)--only-first-eval-pipeline-failed-lookup-not-keep(keeps rows whereu64_autotriage_first_eval_outcome=pipeline_failed_lookup_not_keep)--only-first-eval-pipeline-failed(keeps rows whereu64_autotriage_first_eval_outcomestarts withpipeline_failed)--only-first-eval-pipeline-failed-low-load-hit-geo-regression(keeps rows whereu64_autotriage_first_eval_outcome=pipeline_failed_low_load_hit_geo_regression)--only-first-eval-pipeline-failed-lookup-not-keep-and-low-load-hit-geo-regression(keeps rows whereu64_autotriage_first_eval_outcome=pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression)--only-latest-failure-family-pipeline-lookup-not-keep(keeps rows where latest autotriage failure family indicates pipeline lookup-not-keep, including combined pipeline lookup+low-load families)--only-latest-failure-family-pipeline-low-load-hit-geo-regression(keeps rows where latest autotriage failure family indicates pipeline low-load hit-geo regression, including combined pipeline lookup+low-load families)--only-latest-failure-family-pipeline-lookup-not-keep-and-low-load-hit-geo-regression(keeps rows where latest autotriage failure family indicates combined pipeline lookup-not-keep + low-load hit-geo regression)--only-latest-failure-family-recheck-lookup-not-keep(keeps rows where latest autotriage failure family indicates recheck lookup-not-keep, including combined recheck lookup+low-load families)--only-latest-failure-family-recheck-low-load-hit-geo-regression(keeps rows where latest autotriage failure family indicates recheck low-load hit-geo regression, including combined recheck lookup+low-load families)--only-latest-failure-family-recheck-lookup-not-keep-and-low-load-hit-geo-regression(keeps rows where latest autotriage failure family indicates combined recheck lookup-not-keep + low-load hit-geo regression)--only-first-eval-actionable-then-failed-known(keeps rows where the derived first-eval-actionable-then-failed state isyesorno)--only-first-eval-pipeline-verdict-gate-runs-completed-known(keeps rows wherefirst_eval_outcomestarts withpipeline_failedand autotriage pipeline verdict-gate runs-completed metadata is numeric)--only-first-eval-pipeline-verdict-gate-runs-completed-unknown(keeps rows wherefirst_eval_outcomestarts withpipeline_failedand autotriage pipeline verdict-gate runs-completed metadata is unknown)--only-first-eval-pipeline-verdict-gate-runs-completed-eq <N>(keeps rows wherefirst_eval_outcomestarts withpipeline_failedand autotriage pipeline verdict-gate runs-completed equals<N>)--only-first-eval-pipeline-verdict-gate-runs-completed-min <N>(keeps rows wherefirst_eval_outcomestarts withpipeline_failedand autotriage pipeline verdict-gate runs-completed is>= <N>)--only-first-eval-pipeline-verdict-gate-runs-completed-max <N>(keeps rows wherefirst_eval_outcomestarts withpipeline_failedand autotriage pipeline verdict-gate runs-completed is<= <N>)--only-first-eval-recheck-verdict-gate-runs-completed-known(keeps rows wherefirst_eval_outcomestarts withactionable_recheck_failedand autotriage recheck verdict-gate runs-completed metadata is numeric)--only-first-eval-recheck-verdict-gate-runs-completed-unknown(keeps rows wherefirst_eval_outcomestarts withactionable_recheck_failedand autotriage recheck verdict-gate runs-completed metadata is unknown)--only-first-eval-recheck-verdict-gate-runs-completed-eq <N>(keeps rows wherefirst_eval_outcomestarts withactionable_recheck_failedand autotriage recheck verdict-gate runs-completed equals<N>)--only-first-eval-recheck-verdict-gate-runs-completed-min <N>(keeps rows wherefirst_eval_outcomestarts withactionable_recheck_failedand autotriage recheck verdict-gate runs-completed is>= <N>)--only-first-eval-recheck-verdict-gate-runs-completed-max <N>(keeps rows wherefirst_eval_outcomestarts withactionable_recheck_failedand autotriage recheck verdict-gate runs-completed is<= <N>)--only-first-eval-actionable-then-failed-unknown(keeps rows where the derived first-eval-actionable-then-failed state isunknown)--only-first-eval-actionable-then-failed-no(keeps rows where the derived first-eval-actionable-then-failed state isno)--only-first-eval-actionable-then-failed-yes(keeps rows where the derived first-eval-actionable-then-failed state isyes)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage pipeline verdict-gate metadata reports early-stop)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-inconclusive(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage pipeline verdict-gate early-stop verdict isinconclusive)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-reject(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage pipeline verdict-gate early-stop verdict isreject)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-unknown(keeps rows where derived first-eval-actionable-then-failed isyes, the pipeline verdict-gate reports early-stop, and the early-stop verdict is missing/unknown)--only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage recheck verdict-gate metadata reports early-stop)--only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-inconclusive(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage recheck verdict-gate early-stop verdict isinconclusive)--only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-reject(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage recheck verdict-gate early-stop verdict isreject)--only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-unknown(keeps rows where derived first-eval-actionable-then-failed isyes, the recheck verdict-gate reports early-stop, and the early-stop verdict is missing/unknown)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-known(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage pipeline verdict-gate runs-completed is numeric)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-unknown(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage pipeline verdict-gate runs-completed is unknown)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-eq <N>(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage pipeline verdict-gate runs-completed equals<N>)--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-known(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage recheck verdict-gate runs-completed is numeric)--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-unknown(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage recheck verdict-gate runs-completed is unknown)--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-eq <N>(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage recheck verdict-gate runs-completed equals<N>)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-min <N>(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage pipeline verdict-gate runs-completed is>= <N>)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-max <N>(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage pipeline verdict-gate runs-completed is<= <N>)--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-min <N>(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage recheck verdict-gate runs-completed is>= <N>)--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-max <N>(keeps rows where derived first-eval-actionable-then-failed isyesand autotriage recheck verdict-gate runs-completed is<= <N>)--only-pipeline-early-stop-unknown(keeps rows where autotriage pipeline verdict-gate metadata reportsearly_stopped=yesand the recorded early-stop verdict is unknown/missing)--only-recheck-early-stop-unknown(keeps rows where autotriage recheck verdict-gate metadata reportsearly_stopped=yesand the recorded early-stop verdict is unknown/missing)--only-pipeline-verdict-gate-runs-completed-known(keeps rows where autotriage pipeline verdict-gate runs-completed metadata is numeric)--only-pipeline-verdict-gate-runs-completed-unknown(keeps rows where autotriage pipeline verdict-gate runs-completed metadata is unknown/missing)--only-recheck-verdict-gate-runs-completed-known(keeps rows where autotriage recheck verdict-gate runs-completed metadata is numeric)--only-recheck-verdict-gate-runs-completed-unknown(keeps rows where autotriage recheck verdict-gate runs-completed metadata is unknown/missing)--only-pipeline-verdict-gate-runs-completed-eq <N>(keeps rows where autotriage pipeline verdict-gate runs-completed metadata equals<N>)--only-pipeline-verdict-gate-runs-completed-min <N>(keeps rows where autotriage pipeline verdict-gate runs-completed metadata is>= <N>)--only-pipeline-verdict-gate-runs-completed-max <N>(keeps rows where autotriage pipeline verdict-gate runs-completed metadata is<= <N>)--only-recheck-verdict-gate-runs-completed-eq <N>(keeps rows where autotriage recheck verdict-gate runs-completed metadata equals<N>)--only-recheck-verdict-gate-runs-completed-min <N>(keeps rows where autotriage recheck verdict-gate runs-completed metadata is>= <N>)--only-recheck-verdict-gate-runs-completed-max <N>(keeps rows where autotriage recheck verdict-gate runs-completed metadata is<= <N>)--limit <N>(-1keeps all rows)--promotion-filter all|lookup_actionable|insert_actionable|any_actionable(uses derivedlookup_promotion_status/insert_promotion_statuscolumns to quickly focus on candidate labels that are promotable vs blocked)--promotion-filter strict_lookup_ready(requireslookup=keepplus apasseddual-gate row; surfaces stricter promotion-ready labels vialookup_strict_promotion_status)--promotion-filter u64_pipeline_failed|u64_pipeline_low_load_hit_geo_regressed|u64_pipeline_low_load_skipped|u64_pipeline_low_load_skipped_lookup_not_keep|u64_pipeline_lookup_gate_early_stopped|u64_pipeline_lookup_gate_early_stopped_inconclusive|u64_pipeline_lookup_gate_early_stopped_reject|u64_pipeline_lookup_gate_early_stopped_unknown|u64_pipeline_lookup_gate_runs_completed_known|u64_pipeline_lookup_gate_runs_completed_unknown|u64_pipeline_lookup_gate_runs_completed_eq3|u64_pipeline_lookup_gate_runs_completed_eq4|u64_pipeline_lookup_gate_runs_completed_in_range|u64_pipeline_actionable|u64_pipeline_actionable_confirmed(filters labels by the latest*-u64-candidate-pipeline.tsvoutcome metadata;u64_pipeline_low_load_skippedsurfaces rows where pipeline low-load stage was skipped, andu64_pipeline_low_load_skipped_lookup_not_keepnarrows this to explicit lookup-verdict-not-keep skip reasons recorded inu64_pipeline_low_load_skip_reason;u64_pipeline_lookup_gate_early_stoppedsurfaces rows where lookup verdict gate stopped before exhausting requested gate runs, with..._inconclusive/..._reject/..._unknownas direct splits by recorded early-stop verdict; status-board rows also exposeu64_pipeline_lookup_verdict_gate_runs_completed,u64_pipeline_lookup_verdict_gate_early_stopped, andu64_pipeline_lookup_verdict_gate_early_stop_verdictfrom the underlying lookup verdict sidecar metadata;u64_pipeline_lookup_gate_runs_completed_*promotion filters slice this same run-count metadata into known/unknown/eq3/eq4/in-range cohorts (the_in_rangevariant requires--only-pipeline-verdict-gate-runs-completed-minand/or--only-pipeline-verdict-gate-runs-completed-max);u64_pipeline_actionablerequires an explicit non-regressed low-load hit-geo signal pluslookup_combined_verdict=keep;u64_pipeline_actionable_confirmedadditionally requires latestu64_eval_outcome=actionable)--promotion-filter u64_eval_failed|u64_eval_actionable|u64_eval_actionable_confirmed|u64_eval_actionable_recheck_failed|u64_eval_actionable_unconfirmed|u64_eval_pipeline_failed_lookup_not_keep|u64_eval_pipeline_failed_low_load_hit_geo_regression|u64_eval_pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression|u64_eval_actionable_recheck_failed_lookup_not_keep|u64_eval_actionable_recheck_failed_low_load_hit_geo_regression|u64_eval_actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression|u64_eval_pipeline_lookup_not_keep|u64_eval_pipeline_low_load_hit_geo_regression|u64_eval_recheck_pipeline_failed|u64_eval_recheck_pipeline_lookup_not_keep|u64_eval_recheck_pipeline_low_load_hit_geo_regression|u64_autotriage_failed|u64_autotriage_actionable|u64_autotriage_actionable_confirmed|u64_autotriage_confirmation_complete|u64_autotriage_confirmation_incomplete|u64_autotriage_confirmation_multi_run|u64_autotriage_confirmation_floor_applied|u64_autotriage_confirmation_floor_not_applied|u64_autotriage_first_eval_actionable|u64_autotriage_first_eval_actionable_recheck_failed|u64_autotriage_first_eval_actionable_recheck_failed_lookup_not_keep|u64_autotriage_first_eval_actionable_recheck_failed_low_load_hit_geo_regression|u64_autotriage_first_eval_pipeline_failed_lookup_not_keep|u64_autotriage_first_eval_pipeline_failed_low_load_hit_geo_regression|u64_autotriage_first_eval_actionable_then_failed|u64_autotriage_first_eval_actionable_then_failed_yes|u64_autotriage_first_eval_actionable_then_failed_known|u64_autotriage_first_eval_actionable_then_failed_unknown|u64_autotriage_first_eval_actionable_then_failed_no|u64_autotriage_streak_actionable_before_failure|u64_autotriage_failed_lookup_not_keep|u64_autotriage_failed_low_load_hit_geo_regression|u64_autotriage_failed_lookup_not_keep_and_low_load_hit_geo_regression|u64_autotriage_failed_pipeline_lookup_not_keep|u64_autotriage_failed_pipeline_low_load_hit_geo_regression|u64_autotriage_failed_pipeline_lookup_not_keep_and_low_load_hit_geo_regression|u64_autotriage_failed_recheck_lookup_not_keep|u64_autotriage_failed_recheck_low_load_hit_geo_regression|u64_autotriage_failed_recheck_lookup_not_keep_and_low_load_hit_geo_regression(additional first-eval split promotion filters:u64_autotriage_first_eval_failed,u64_autotriage_first_eval_lookup_not_keep,u64_autotriage_first_eval_low_load_hit_geo_regression,u64_autotriage_first_eval_pipeline_failed,u64_autotriage_first_eval_pipeline_verdict_gate_early_stopped,u64_autotriage_first_eval_pipeline_verdict_gate_early_stopped_inconclusive,u64_autotriage_first_eval_pipeline_verdict_gate_early_stopped_reject,u64_autotriage_first_eval_pipeline_verdict_gate_early_stopped_unknown,u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_known,u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_unknown,u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_eq3,u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_eq4,u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_in_range,u64_autotriage_first_eval_recheck_pipeline_verdict_gate_early_stopped,u64_autotriage_first_eval_recheck_pipeline_verdict_gate_early_stopped_inconclusive,u64_autotriage_first_eval_recheck_pipeline_verdict_gate_early_stopped_reject,u64_autotriage_first_eval_recheck_pipeline_verdict_gate_early_stopped_unknown,u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_known,u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_unknown,u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_eq3,u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_eq4,u64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_in_range,u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped,u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped_inconclusive,u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped_reject,u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped_unknown,u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_known,u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_unknown,u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_eq3,u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_eq4,u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_in_range,u64_autotriage_first_eval_actionable_then_failed_pipeline_verdict_gate_runs_completed_in_range,u64_autotriage_first_eval_actionable_then_failed_recheck_verdict_gate_runs_completed_in_range,u64_autotriage_first_eval_actionable_then_failed_pipeline_verdict_gate_early_stopped,u64_autotriage_first_eval_actionable_then_failed_pipeline_verdict_gate_early_stopped_inconclusive,u64_autotriage_first_eval_actionable_then_failed_pipeline_verdict_gate_early_stopped_reject,u64_autotriage_first_eval_actionable_then_failed_pipeline_verdict_gate_early_stopped_unknown,u64_autotriage_first_eval_actionable_then_failed_recheck_verdict_gate_early_stopped,u64_autotriage_first_eval_actionable_then_failed_recheck_verdict_gate_early_stopped_inconclusive,u64_autotriage_first_eval_actionable_then_failed_recheck_verdict_gate_early_stopped_reject,u64_autotriage_first_eval_actionable_then_failed_recheck_verdict_gate_early_stopped_unknown,u64_autotriage_first_eval_actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regressionandu64_autotriage_first_eval_pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression) (additional autotriage early-stop promotion filters:u64_autotriage_pipeline_verdict_gate_early_stopped,u64_autotriage_pipeline_verdict_gate_early_stopped_inconclusive,u64_autotriage_pipeline_verdict_gate_early_stopped_reject,u64_autotriage_pipeline_verdict_gate_early_stopped_unknown,u64_autotriage_recheck_pipeline_verdict_gate_early_stopped,u64_autotriage_recheck_pipeline_verdict_gate_early_stopped_inconclusive,u64_autotriage_recheck_pipeline_verdict_gate_early_stopped_reject,u64_autotriage_recheck_pipeline_verdict_gate_early_stopped_unknown,u64_autotriage_pipeline_verdict_gate_runs_completed_known,u64_autotriage_pipeline_verdict_gate_runs_completed_unknown,u64_autotriage_pipeline_verdict_gate_runs_completed_eq3,u64_autotriage_pipeline_verdict_gate_runs_completed_eq4,u64_autotriage_pipeline_verdict_gate_runs_completed_in_range,u64_autotriage_recheck_pipeline_verdict_gate_runs_completed_known,u64_autotriage_recheck_pipeline_verdict_gate_runs_completed_unknown,u64_autotriage_recheck_pipeline_verdict_gate_runs_completed_eq3,u64_autotriage_recheck_pipeline_verdict_gate_runs_completed_eq4, andu64_autotriage_recheck_pipeline_verdict_gate_runs_completed_in_range) (filters labels by the latest*-u64-candidate-eval.tsvoutcome metadata, including automatic actionable recheck outcomes;u64_eval_failedalso includesactionable_unconfirmedandstatus_board_missing_label; theu64_eval_pipeline_*andu64_eval_recheck_pipeline_*filters classify eval rows by first-pass vs recheck pipeline reasons, using eval sidecar row-count columns (with final-message fallback). This includesu64_eval_pipeline_low_load_skippedandu64_eval_pipeline_low_load_skipped_lookup_not_keep, sourced from eval-side pipeline skip snapshot row counts, plusu64_eval_pipeline_verdict_gate_early_stopped*andu64_eval_recheck_pipeline_verdict_gate_early_stopped*filters sourced from eval-side verdict-gate metadata columns. Additional scoped variantsu64_eval_pipeline_failed_verdict_gate_early_stopped*andu64_eval_actionable_recheck_failed_verdict_gate_early_stopped*narrow those early-stop filters topipeline_failed*andactionable_recheck_failed*outcomes respectively. Run-count scoped variants (u64_eval_pipeline_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}andu64_eval_actionable_recheck_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}) provide matching filters over eval-side verdict-gate run-count metadata; the_in_rangevariants require at least one corresponding bound flag:--only-eval-pipeline-failed-verdict-gate-runs-completed-{min,max}or--only-eval-actionable-recheck-failed-verdict-gate-runs-completed-{min,max}; status-board rows now also includeu64_autotriage_latest_confirmed_actionable_rowsandu64_autotriage_latest_failure_family, and theu64_autotriage_failed_*filters (including pipeline-only and recheck-only variants) consult that family value so legacy rows are classified consistently; rows also includeu64_autotriage_confirmation_runs,u64_autotriage_confirmation_runs_requested,u64_autotriage_min_confirmation_runs_on_success,u64_autotriage_completed_eval_runs,u64_autotriage_last_eval_label,u64_autotriage_requested_candidate_label, andu64_autotriage_latest_candidate_labelmetadata from autotriage sidecars, plus run-sequence diagnostics:u64_autotriage_first_eval_label,u64_autotriage_first_eval_outcome,u64_autotriage_first_eval_exit_status,u64_autotriage_first_eval_actionable_then_failed,u64_autotriage_streak_actionable_before_failure, andu64_autotriage_run_outcome_sequence; rows also include latest eval verdict-gate metadata from autotriage sidecars:u64_autotriage_pipeline_lookup_verdict_gate_runs_completed,u64_autotriage_pipeline_lookup_verdict_gate_early_stopped,u64_autotriage_pipeline_lookup_verdict_gate_early_stop_verdict,u64_autotriage_recheck_pipeline_lookup_verdict_gate_runs_completed,u64_autotriage_recheck_pipeline_lookup_verdict_gate_early_stopped, andu64_autotriage_recheck_pipeline_lookup_verdict_gate_early_stop_verdict. Theu64_autotriage_first_eval_actionable,u64_autotriage_first_eval_failed,u64_autotriage_first_eval_lookup_not_keep,u64_autotriage_first_eval_low_load_hit_geo_regression,u64_autotriage_first_eval_actionable_recheck_failed,u64_autotriage_first_eval_actionable_recheck_failed_lookup_not_keep,u64_autotriage_first_eval_actionable_recheck_failed_low_load_hit_geo_regression,u64_autotriage_first_eval_actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression,u64_autotriage_first_eval_pipeline_failed,u64_autotriage_first_eval_pipeline_failed_lookup_not_keep, andu64_autotriage_first_eval_pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression,u64_autotriage_first_eval_actionable_then_failedpromotion filters are useful for isolating streak-instability candidates where run 1 looked promotable but later confirmation runs failed. The first-eval pipeline/recheck verdict-gate early-stop filters provide the same view with explicitinconclusivevsrejectsplits while keeping first-eval scope constraints (pipeline_failed*vsactionable_recheck_failed*) intact. The first-eval run-count promotion filters provide quick known/eq/range splits for first-pass (pipeline_failed*) and recheck (actionable_recheck_failed*) verdict-gate metadata directly on status-board rows. The_in_rangevariants require at least one matching min/max flag for the corresponding row filter family:--only-first-eval-pipeline-verdict-gate-runs-completed-{min,max},--only-first-eval-recheck-verdict-gate-runs-completed-{min,max},--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-{min,max}, or--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-{min,max}. The_yes,_known,_unknown, and_novariants help separate first-eval instability rows from legacy rows lacking first-eval telemetry and from rows that are instrumented but stable. Additionalu64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}filters narrow that instability cohort to rows with numeric verdict-gate run-count metadata, explicit unknown metadata buckets, exact=3/=4completion, or configured min/max ranges for first-pass vs recheck confirmation lanes. Additionalu64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_early_stopped{,_inconclusive,_reject}filters narrow that same instability cohort to rows where pipeline/recheck verdict gates early-stopped, with optional explicit early-stop verdict splits. Theu64_autotriage_*_verdict_gate_early_stopped*filters are useful for splitting those first-eval failures by recorded early-stop verdict class (inconclusivevsreject) without leaving status-board workflows. Rows also include a derivedu64_autotriage_confirmation_statecolumn (single_run_complete,single_run_incomplete,multi_run_complete,multi_run_incomplete) andu64_autotriage_confirmation_floor_applied(yes/no/unknown) (u64_autotriage_confirmation_incompletecatches rows where completed runs are fewer than requested confirmation runs))--max-age-hours <N>(optional recency filter based on sidecar timestamp prefixes)
To rank low-load regression candidates from status-board rows, use:
python3 scripts/bench_low_load_candidate_pick.py --limit 20Useful controls:
--promotion-filter all|cycle_low_load_picker_score_below|cycle_low_load_picker_score_unknown|cycle_low_load_picker_score_below_threshold|cycle_low_load_picker_score_unknown_or_below_threshold--cycle-low-load-picker-score-min <N>(used with score-threshold filters)--inject-labels <csv>(force-include specific labels, useful for ensuring the just-produced low-load label appears in ranking output)--strict/--no-strict(effective-load parity requirement)--exclude-label-substrings <csv>--max-age-hours <N>
scripts/bench_lookup_candidate_cycle.sh now forwards these controls through:
HARRIER_BENCH_LOW_LOAD_PICKER_PROMOTION_FILTER(autoby default; auto resolves tocycle_low_load_picker_score_unknown_or_below_thresholdwhenHARRIER_BENCH_CYCLE_FAIL_ON_LOW_LOAD_PICKER_SCORE_BELOW=1, elseall)HARRIER_BENCH_CYCLE_LOW_LOAD_PICKER_SCORE_MINHARRIER_BENCH_RUN_LOOKUP_GATE(1by default; set to0to skip the lookup gate stage and run only downstream cycle stages such as low-load curve- picker)
HARRIER_BENCH_LOW_LOAD_PICKER_INJECT_CURRENT_LABEL(1by default; when enabled, automatically injects the currentlow-load-post-<candidate>label into picker ranking output even if promotion filters would exclude it)HARRIER_BENCH_LOW_LOAD_PICKER_INJECT_LABELS(optional comma-separated additional labels to inject into picker ranking output)
For a u64-focused adaptive low-load cycle with these stricter picker defaults pre-wired, use:
bash scripts/bench_lookup_candidate_cycle_u64_low_load.shThis wrapper defaults to:
HARRIER_BENCH_LOOKUP_SCOPE=u64HARRIER_BENCH_CYCLE_ONLY_IMPL=harrier_u64,hashbrownHARRIER_BENCH_CYCLE_CONTINUE_ON_LOOKUP_FAILURE=1HARRIER_BENCH_CYCLE_FAIL_ON_LOOKUP_FAILURE=0HARRIER_BENCH_RUN_INSERT_MONITOR=0HARRIER_BENCH_RUN_LOW_LOAD_CURVE=1HARRIER_BENCH_RUN_LOW_LOAD_PICKER=1HARRIER_BENCH_LOW_LOAD_PICKER_PROMOTION_FILTER=autoHARRIER_BENCH_CYCLE_FAIL_ON_LOW_LOAD_PICKER_SCORE_BELOW=1HARRIER_BENCH_CYCLE_LOW_LOAD_PICKER_SCORE_MIN=15HARRIER_BENCH_CYCLE_FAIL_ON_LOW_LOAD_HIT_GEO_REGRESSION=1HARRIER_BENCH_CYCLE_LOW_LOAD_HIT_GEO_MIN=1.0HARRIER_BENCH_RUN_STATUS_BOARD_SUMMARY=0
All wrapper defaults remain overridable via environment variables.
To run a u64 candidate loop that first executes the strict u64 lookup gate and then runs the u64 low-load cycle with lookup reruns disabled, use:
bash scripts/bench_lookup_u64_candidate_pipeline.shThis pipeline defaults to:
HARRIER_BENCH_PIPELINE_RUN_LOOKUP_GATE=1HARRIER_BENCH_PIPELINE_RUN_LOW_LOAD_CYCLE=1HARRIER_BENCH_PIPELINE_REQUIRE_LOOKUP_PASS_FOR_LOW_LOAD=1HARRIER_BENCH_PIPELINE_REQUIRE_LOOKUP_KEEP=1HARRIER_BENCH_PIPELINE_SKIP_LOW_LOAD_WHEN_LOOKUP_NOT_KEEP=1(when lookup gate passes but combined verdict is notkeep, skip low-load cycle stage to fail fast in strict loops)HARRIER_BENCH_PIPELINE_FAIL_ON_LOW_LOAD_HIT_GEO_REGRESSION=1HARRIER_BENCH_PIPELINE_VALIDATE_PROMOTION_PARITY=1(default; validates compact promotion-filter parity viascripts/check_autotriage_promotion_parity.pybefore pipeline execution; set to0to skip this preflight guard)- low-load stage invocation via
scripts/bench_lookup_candidate_cycle_u64_low_load.shwithHARRIER_BENCH_RUN_LOOKUP_GATE=0 - report output:
<stamp>-<candidate>-u64-candidate-pipeline.tsvincluding lookup verdict metadata and cycle low-load picker summary fields (low_load_skip_reason,lookup_verdict_gate_runs_completed,lookup_verdict_gate_early_stopped,lookup_verdict_gate_early_stop_verdict,cycle_low_load_picker_rows, top score, score-threshold flag, hit-geo ratio/regressed flag, etc.).
Note: when HARRIER_BENCH_PIPELINE_RUN_LOOKUP_GATE=0, the pipeline still runs
the low-load cycle stage (the "require lookup pass" check applies only when the
lookup gate stage is enabled).
Diagnostic mode examples:
# Preview commands only
HARRIER_BENCH_PIPELINE_DRY_RUN=1 bash scripts/bench_lookup_u64_candidate_pipeline.sh
# Run low-load cycle even if lookup gate fails
HARRIER_BENCH_PIPELINE_REQUIRE_LOOKUP_PASS_FOR_LOW_LOAD=0 \
bash scripts/bench_lookup_u64_candidate_pipeline.sh
# Allow non-keep lookup verdicts to continue for diagnostics
HARRIER_BENCH_PIPELINE_REQUIRE_LOOKUP_KEEP=0 \
bash scripts/bench_lookup_u64_candidate_pipeline.sh
# Keep running low-load cycle even when lookup verdict is not keep
HARRIER_BENCH_PIPELINE_SKIP_LOW_LOAD_WHEN_LOOKUP_NOT_KEEP=0 \
bash scripts/bench_lookup_u64_candidate_pipeline.shTo run the u64 pipeline plus automatic status-board triage snapshots for one candidate label, use:
bash scripts/bench_lookup_u64_candidate_eval.shThis eval wrapper:
- runs
bench_lookup_u64_candidate_pipeline.sh - writes label-filtered status-board snapshots:
- full label row(s)
u64_pipeline_failedfilteru64_pipeline_low_load_hit_geo_regressedfilteru64_pipeline_actionablefilteru64_pipeline_low_load_skippedfilteru64_pipeline_low_load_skipped_lookup_not_keepfilter
- after writing its eval sidecar, also writes post-eval snapshots:
u64_eval_failedu64_eval_actionableu64_eval_actionable_recheck_failedu64_pipeline_actionable_confirmedu64_eval_pipeline_lookup_not_keepu64_eval_pipeline_low_load_hit_geo_regressionu64_eval_recheck_pipeline_lookup_not_keepu64_eval_recheck_pipeline_low_load_hit_geo_regression
- writes
<stamp>-<candidate>-u64-candidate-eval.tsvlinking all produced sidecars, row counts, and aneval_outcomeclassification (actionable,blocked_low_load_hit_geo_regressed,blocked_pipeline_failed,no_actionable_signal,pipeline_failed,pipeline_failed_lookup_not_keep,pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression,pipeline_failed_low_load_hit_geo_regression,actionable_recheck_failed,actionable_recheck_failed_lookup_not_keep,actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression,actionable_recheck_failed_low_load_hit_geo_regression,actionable_unconfirmed,status_board_missing_label, ordry_run). The eval TSV also records pipeline/recheck final metadata fields (pipeline_final_status,pipeline_final_message,recheck_pipeline_final_status,recheck_pipeline_final_message) plus pipeline low-load skip snapshots/row counts (status_board_u64_pipeline_low_load_skipped_*) to speed triage. The metadata also includescandidate_label_base(defaults to the candidate label, but can be overridden by callers for grouped streak runs). - defaults to exact label matching for status-board snapshots
(
HARRIER_BENCH_EVAL_LABEL_FILTER_EXACT=1), so<label>-recheckrows do not leak into base-label triage snapshots. - by default (
HARRIER_BENCH_EVAL_RECHECK_ON_ACTIONABLE=1), if the first pass classifies a candidate as actionable, the eval wrapper automatically runs a second pipeline pass with label suffixHARRIER_BENCH_EVAL_RECHECK_LABEL_SUFFIX(default-recheck) and downgrades the final result toactionable_recheck_failedif that recheck fails.
Useful controls:
HARRIER_BENCH_EVAL_CONTINUE_ON_PIPELINE_FAILURE=1(default)HARRIER_BENCH_EVAL_DRY_RUN=1HARRIER_BENCH_EVAL_RUN_STATUS_BOARD=0HARRIER_BENCH_EVAL_RUN_POST_EVAL_STATUS_BOARD=0(skip post-evalu64_eval_*snapshots)HARRIER_BENCH_EVAL_REQUIRE_ACTIONABLE=1(fail unlesseval_outcome=actionable)HARRIER_BENCH_EVAL_REQUIRE_CONFIRMED_ACTIONABLE=1(default; when outcome isactionable, additionally requires at least oneu64_pipeline_actionable_confirmedrow in post-eval snapshots; in live mode this also requiresHARRIER_BENCH_EVAL_RUN_STATUS_BOARD=1andHARRIER_BENCH_EVAL_RUN_POST_EVAL_STATUS_BOARD=1)HARRIER_BENCH_EVAL_REQUIRE_LABEL_ROW=1(default; in live mode requires at least one exact-label status-board row, otherwise setseval_outcome=status_board_missing_label; requiresHARRIER_BENCH_EVAL_RUN_STATUS_BOARD=1)HARRIER_BENCH_EVAL_LABEL_FILTER_EXACT=0(revert to substring label matching for snapshots)HARRIER_BENCH_EVAL_RECHECK_ON_ACTIONABLE=0(disable automatic actionable recheck)HARRIER_BENCH_EVAL_RECHECK_LABEL_SUFFIX=-recheckHARRIER_BENCH_EVAL_VALIDATE_PROMOTION_PARITY=1(default; validates compact promotion-filter parity viascripts/check_autotriage_promotion_parity.pybefore pipeline/eval execution; set to0to skip this preflight guard)HARRIER_BENCH_EVAL_CANDIDATE_LABEL_BASE=<label-base>(optional metadata override used by wrappers to keep grouped eval streaks under one base label)
To run strict eval + outcome/failure summary in one command, use:
bash scripts/bench_lookup_u64_candidate_autotriage.shYou can also pass a label as the first positional argument (equivalent to
setting HARRIER_BENCH_CANDIDATE_LABEL):
bash scripts/bench_lookup_u64_candidate_autotriage.sh u64-my-candidateThis wrapper defaults to strict gates (require_actionable=1, confirmed
actionable, exact-label filtering, actionable recheck), runs the eval wrapper,
then prints:
- latest exact-label eval row
- exact-label attributed failure-reason summary
- optional verbose row tables toggle
(
HARRIER_BENCH_AUTOTRIAGE_SHOW_ROW_SECTIONS=1; set to0to keep summaries while skipping high-volume row sections in both base-history and global boards)- when row sections are disabled, compact candidate output still prints
first-eval pipeline/recheck verdict-gate runs-completed status-board
promotion slices (
*_known,*_unknown,*_eq3,*_eq4,*_in_rangewith explicit[3,4]bounds) plus first-eval-actionable-then-failed pipeline/recheck runs-completed slices with the same compact split set, and still prints first-eval pipeline/recheck verdict-gate early-stop status-board promotion slices and first-eval-actionable-then-failed verdict-gate early-stop slices for pipeline/recheck (including*_inconclusive/*_reject/*_unknownsubsets)
- when row sections are disabled, compact candidate output still prints
first-eval pipeline/recheck verdict-gate runs-completed status-board
promotion slices (
- optional extended summaries toggle
(
HARRIER_BENCH_AUTOTRIAGE_SHOW_EXTENDED_SUMMARIES=1; set to0to keep only compact summary subsets in base-history/global boards before row-table controls are applied)- when extended summaries are disabled, compact global output still prints
first-eval pipeline/recheck verdict-gate runs-completed status-board
promotion boards (
*_known,*_unknown,*_eq3,*_eq4,*_in_rangewith explicit[3,4]bounds) plus matching first-eval-actionable-then-failed pipeline/recheck runs-completed promotion boards, in addition to compact early-stop boards (including*_inconclusive/*_reject/*_unknownsubsets)
- when extended summaries are disabled, compact global output still prints
first-eval pipeline/recheck verdict-gate runs-completed status-board
promotion boards (
- compact-output mode is enabled by default
(
HARRIER_BENCH_AUTOTRIAGE_COMPACT_OUTPUT=1; shorthand that forces bothHARRIER_BENCH_AUTOTRIAGE_SHOW_EXTENDED_SUMMARIES=0andHARRIER_BENCH_AUTOTRIAGE_SHOW_ROW_SECTIONS=0; set it to0to disable the compact default) - promotion-filter parity validation is enabled by default before eval runs
(
HARRIER_BENCH_AUTOTRIAGE_VALIDATE_PROMOTION_PARITY=1; runspython3 scripts/check_autotriage_promotion_parity.pyand aborts wrapper execution on parity drift; set it to0to skip this guardrail). The checker verifies compact first-eval/eval run-count + early-stop family coverage across candidate/global boards, status-board promotion-filter implementation, README family markers, and wrapper parity-guard wiring. - run-count summaries default to known-metadata rows only
(
HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=0; set to1to include legacy rows where verdict-gate run-count fields are missing and therefore summarized asunknown) - optional eval failure-family run-count range overlays
(
HARRIER_BENCH_AUTOTRIAGE_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_MIN,..._PIPELINE_RUNS_MAX,HARRIER_BENCH_AUTOTRIAGE_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_MIN,..._RECHECK_RUNS_MAX; when set, compact base-label/global summaries and candidate/global row sections include configured-range views for evalpipeline_failed*pipeline andactionable_recheck_failed*recheck verdict-gate run-count cohorts) - optional first-eval-actionable-then-failed run-count range overlays
(
HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_THEN_FAILED_PIPELINE_RUNS_MIN,..._PIPELINE_RUNS_MAX,..._RECHECK_RUNS_MIN,..._RECHECK_RUNS_MAX; when set, compact base-label summaries and extended global summaries include extra configured-range boards for first-eval actionable-then-failed pipeline and recheck verdict-gate run-count distributions) - optional first-eval failure-family run-count range overlays
(
HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_MIN,..._PIPELINE_RUNS_MAX,HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_MIN,..._RECHECK_RUNS_MAX; when set, compact base-label summaries plus candidate/global row sections include configured-range views for first-evalpipeline_failed*pipeline andactionable_recheck_failed*recheck verdict-gate run-count cohorts) - optional base-label history summary (
HARRIER_BENCH_AUTOTRIAGE_SHOW_BASE_HISTORY=1)- prints eval-outcome counts and eval failure-family counts for the same normalized base label (failure-family summary is failed rows only)
- additionally prints eval-side pipeline low-load-skipped failure-family
counts for that base label (
--only-pipeline-low-load-skipped) - additionally prints eval-side pipeline low-load-skipped lookup-not-keep
failure-family counts for that base label
(
--only-pipeline-low-load-skipped-lookup-not-keep) - additionally prints eval-side pipeline verdict-gate early-stop verdict
counts for that base label
(
--only-pipeline-verdict-gate-early-stopped --summary-key pipeline_lookup_verdict_gate_early_stop_verdict) - additionally prints eval-side pipeline verdict-gate runs-completed
distributions for that base label
(
--summary-key pipeline_lookup_verdict_gate_runs_completedwith--only-pipeline-verdict-gate-runs-completed-knownby default; setHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1to includeunknownbuckets) - additionally prints eval-side recheck verdict-gate early-stop verdict
counts for that base label
(
--only-recheck-verdict-gate-early-stopped --summary-key recheck_pipeline_lookup_verdict_gate_early_stop_verdict) - additionally prints eval-side recheck verdict-gate runs-completed
distributions for that base label
(
--summary-key recheck_pipeline_lookup_verdict_gate_runs_completedwith--only-recheck-verdict-gate-runs-completed-knownby default; setHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1to includeunknownbuckets) - additionally prints eval-side
pipeline_failed*verdict-gate early-stop verdict counts for that base label (--only-pipeline-failed-verdict-gate-early-stopped --summary-key pipeline_lookup_verdict_gate_early_stop_verdict) - additionally prints eval-side
pipeline_failed*verdict-gate runs-completed distributions for that base label (--only-pipeline-failed-verdict-gate-runs-completed-known --summary-key pipeline_lookup_verdict_gate_runs_completed) - additionally prints eval-side
pipeline_failed*verdict-gate runs-completed unknown-only summaries for that base label (--only-pipeline-failed-verdict-gate-runs-completed-unknown) - additionally prints eval-side
pipeline_failed*verdict-gate runs-completed exact-3 and exact-4 summaries for that base label (--pipeline-verdict-gate-runs-completed-eq {3,4}scoped with--only-pipeline-failed-verdict-gate-runs-completed-known) - additionally prints eval-side
pipeline_failed*verdict-gate runs-completed configured-range summaries for that base label whenHARRIER_BENCH_AUTOTRIAGE_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX}is set - additionally prints eval-side
pipeline_failed*verdict-gate early-stop inconclusive/reject/unknown focused summaries for that base label (--only-pipeline-failed-verdict-gate-early-stopped-{inconclusive,reject,unknown}) - additionally prints eval-side
actionable_recheck_failed*verdict-gate early-stop verdict counts for that base label (--only-actionable-recheck-failed-verdict-gate-early-stopped --summary-key recheck_pipeline_lookup_verdict_gate_early_stop_verdict) - additionally prints eval-side
actionable_recheck_failed*recheck verdict-gate runs-completed distributions for that base label (--only-actionable-recheck-failed-verdict-gate-runs-completed-known --summary-key recheck_pipeline_lookup_verdict_gate_runs_completed) - additionally prints eval-side
actionable_recheck_failed*recheck verdict-gate runs-completed unknown-only summaries for that base label (--only-actionable-recheck-failed-verdict-gate-runs-completed-unknown) - additionally prints eval-side
actionable_recheck_failed*recheck verdict-gate runs-completed exact-3 and exact-4 summaries for that base label (--recheck-verdict-gate-runs-completed-eq {3,4}scoped with--only-actionable-recheck-failed-verdict-gate-runs-completed-known) - additionally prints eval-side
actionable_recheck_failed*recheck verdict-gate runs-completed configured-range summaries for that base label whenHARRIER_BENCH_AUTOTRIAGE_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX}is set - additionally prints eval-side
actionable_recheck_failed*verdict-gate early-stop inconclusive/reject/unknown focused summaries for that base label (--only-actionable-recheck-failed-verdict-gate-early-stopped-{inconclusive,reject,unknown}) - additionally prints first-eval (autotriage sidecar) pipeline verdict-gate
early-stop verdict counts for failed first-eval outcomes on that base label
(
--only-first-eval-pipeline-verdict-gate-early-stopped --summary-key pipeline_lookup_verdict_gate_early_stop_verdict) - additionally prints first-eval (autotriage sidecar) pipeline verdict-gate
early-stop inconclusive/reject/unknown focused summaries for failed first-eval
outcomes on that base label
(
--only-first-eval-pipeline-verdict-gate-early-stopped-{inconclusive,reject,unknown}) - additionally prints first-eval (autotriage sidecar)
pipeline_failed*verdict-gate runs-completed distributions for that base label (--only-first-eval-pipeline-verdict-gate-runs-completed-known --summary-key pipeline_lookup_verdict_gate_runs_completed; known-only by default unlessHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1) - additionally prints first-eval (autotriage sidecar)
pipeline_failed*verdict-gate runs-completed unknown-only summaries for that base label (--only-first-eval-pipeline-verdict-gate-runs-completed-unknown) - additionally prints first-eval (autotriage sidecar)
pipeline_failed*verdict-gate runs-completed exact-3 and exact-4 summaries for that base label (--only-first-eval-pipeline-verdict-gate-runs-completed-eq {3,4}) - additionally prints first-eval (autotriage sidecar)
pipeline_failed*verdict-gate runs-completed configured-range summaries for that base label whenHARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX}is set - additionally prints first-eval (autotriage sidecar) recheck verdict-gate
early-stop verdict counts for actionable-recheck-failed first-eval outcomes
on that base label
(
--only-first-eval-recheck-verdict-gate-early-stopped --summary-key recheck_pipeline_lookup_verdict_gate_early_stop_verdict) - additionally prints first-eval (autotriage sidecar) recheck verdict-gate
early-stop inconclusive/reject/unknown focused summaries for
actionable-recheck-failed first-eval outcomes on that base label
(
--only-first-eval-recheck-verdict-gate-early-stopped-{inconclusive,reject,unknown}) - additionally prints first-eval (autotriage sidecar)
actionable_recheck_failed*recheck verdict-gate runs-completed distributions for that base label (--only-first-eval-recheck-verdict-gate-runs-completed-known --summary-key recheck_pipeline_lookup_verdict_gate_runs_completed; known-only by default unlessHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1) - additionally prints first-eval (autotriage sidecar)
actionable_recheck_failed*recheck verdict-gate runs-completed unknown-only summaries for that base label (--only-first-eval-recheck-verdict-gate-runs-completed-unknown) - additionally prints first-eval (autotriage sidecar)
actionable_recheck_failed*recheck verdict-gate runs-completed exact-3 and exact-4 summaries for that base label (--only-first-eval-recheck-verdict-gate-runs-completed-eq {3,4}) - additionally prints first-eval (autotriage sidecar)
actionable_recheck_failed*recheck verdict-gate runs-completed configured-range summaries for that base label whenHARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX}is set - additionally prints confirmed actionable eval rows for that base label
(
eval_outcome=actionablewith confirmed actionable rows > 0) - additionally prints confirmed actionable autotriage rows for that base
label (
latest_eval_outcome=actionablewithlatest_confirmed_actionable_rows > 0) - additionally prints autotriage first-eval-actionable-then-failed
distribution for that base label in compact mode
(
--summary-key first_eval_actionable_then_failed) - additionally prints autotriage first-eval-actionable-then-failed
pipeline/recheck verdict-gate runs-completed distributions for that base
label in compact mode
(
--summary-key {pipeline_lookup_verdict_gate_runs_completed,recheck_pipeline_lookup_verdict_gate_runs_completed}; defaults to known-only--only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-runs-completed-known, and widens to--only-first-eval-actionable-then-failed-yeswhenHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1) - additionally prints autotriage first-eval-actionable-then-failed
pipeline/recheck verdict-gate runs-completed unknown-only summaries for
that base label in compact mode
(
--only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-runs-completed-unknown) - additionally prints autotriage first-eval-actionable-then-failed
pipeline/recheck verdict-gate runs-completed exact-3 and exact-4 summaries
for that base label in compact mode
(
--only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-runs-completed-eq {3,4}) - additionally prints autotriage first-eval-actionable-then-failed
pipeline/recheck verdict-gate runs-completed configured-range summaries for
that base label in compact mode when
HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_THEN_FAILED_{PIPELINE,RECHECK}_RUNS_{MIN,MAX}is set - additionally prints autotriage first-eval-actionable-then-failed
pipeline/recheck verdict-gate early-stop verdict distributions (plus
explicit inconclusive/reject/unknown splits) for that base label in compact mode
(
--only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-early-stopped{,-inconclusive,-reject,-unknown}) - additionally prints autotriage pipeline verdict-gate runs-completed
distribution for that base label in compact mode
(
--summary-key pipeline_lookup_verdict_gate_runs_completed; default uses--only-pipeline-verdict-gate-runs-completed-known, and drops the known filter whenHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1) - additionally prints autotriage recheck verdict-gate runs-completed
distribution for that base label in compact mode
(
--summary-key recheck_pipeline_lookup_verdict_gate_runs_completed; default uses--only-recheck-verdict-gate-runs-completed-known, and drops the known filter whenHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1) - additionally prints autotriage yes-only first-eval-actionable-then-failed
run-outcome-sequence distribution for that base label in compact mode
(
--only-first-eval-actionable-then-failed-yes --summary-key run_outcome_sequence) - additionally prints autotriage confirmation-state counts for that base
label (
single_run_complete,single_run_incomplete,multi_run_complete,multi_run_incomplete) - additionally prints autotriage confirmation-floor counts for that base
label (
confirmation_floor_applied=yes|no|unknown) - additionally prints latest-failure-family pipeline-lookup-not-keep counts
for that base label (
latest_failure_family=pipeline_lookup_not_keep) - additionally prints latest-failure-family
pipeline-low-load-hit-geo-regression counts for that base label
(
latest_failure_family=pipeline_low_load_hit_geo_regression) - additionally prints latest-failure-family
pipeline-lookup-not-keep+low-load-hit-geo-regression counts for that base
label
(
latest_failure_family=pipeline_lookup_not_keep+pipeline_low_load_hit_geo_regression) - additionally prints latest-failure-family recheck-lookup-not-keep counts
for that base label (
latest_failure_family=recheck_lookup_not_keep) - additionally prints latest-failure-family
recheck-low-load-hit-geo-regression counts for that base label
(
latest_failure_family=recheck_low_load_hit_geo_regression) - additionally prints latest-failure-family
recheck-lookup-not-keep+low-load-hit-geo-regression counts for that base
label (
latest_failure_family=recheck_lookup_not_keep+low_load_hit_geo_regression) - additionally prints autotriage streak-instability counts for that base
label (
streak_actionable_before_failure=yes|no|unknown) - additionally prints first-eval-outcome counts for that base label
(
first_eval_outcome) - additionally prints first-eval-failed counts for that base label
(
first_eval_outcomestarts withpipeline_failedoractionable_recheck_failed) - additionally prints first-eval-lookup-not-keep counts for that base label
(
first_eval_outcomecontainslookup_not_keep) - additionally prints first-eval-low-load-hit-geo-regression counts for that
base label (
first_eval_outcomecontainslow_load_hit_geo_regression) - additionally prints first-eval actionable counts for that base label
(
first_eval_outcome=actionable) - additionally prints first-eval actionable-recheck-failed counts for that
base label (
first_eval_outcome=actionable_recheck_failed*) - additionally prints first-eval actionable-recheck-failed-lookup-not-keep
counts for that base label
(
first_eval_outcome=actionable_recheck_failed_lookup_not_keep) - additionally prints first-eval
actionable-recheck-failed-low-load-hit-geo-regression counts for that base
label
(
first_eval_outcome=actionable_recheck_failed_low_load_hit_geo_regression) - additionally prints first-eval actionable-recheck-failed
lookup-not-keep+low-load-hit-geo-regression counts for that base label
(
first_eval_outcome=actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression) - additionally prints first-eval pipeline-failed-lookup-not-keep counts for
that base label (
first_eval_outcome=pipeline_failed_lookup_not_keep) - additionally prints first-eval pipeline-failed counts for that base label
(
first_eval_outcome=pipeline_failed*) - additionally prints first-eval pipeline-failed-low-load-hit-geo-regression
counts for that base label
(
first_eval_outcome=pipeline_failed_low_load_hit_geo_regression) - additionally prints first-eval pipeline-failed
lookup-not-keep+low-load-hit-geo-regression counts for that base label
(
first_eval_outcome=pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression) - additionally prints first-eval-actionable-then-failed counts for that base
label (derived from first-eval outcomes that prove a run reached actionable
state and still failed overall: either
first_eval_outcome=actionablewith non-zero eval exit status, orfirst_eval_outcome=actionable_recheck_failed*) - additionally prints first-eval-actionable-then-failed known-only counts
for that base label (excludes legacy
unknownrows to focus on instrumented streak telemetry) - additionally prints first-eval-actionable-then-failed unknown-only counts for that base label (tracks telemetry-coverage lag where first-eval fields were not yet populated)
- additionally prints first-eval-actionable-then-failed yes-only counts for that base label (instrumented rows where run 1 was actionable and later confirmation runs failed)
- additionally prints first-eval-actionable-then-failed no-only counts for that base label (instrumented rows where run 1 was not actionable)
- additionally prints eval
pipeline_failed*pipeline verdict-gate runs-completed rows for that base label: scoped (known-only by default, widens to--outcome-filter pipeline_failed_lookup_not_keep,pipeline_failed_low_load_hit_geo_regression,pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regressionwhenHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits - additionally prints configured-range eval
pipeline_failed*pipeline verdict-gate runs-completed rows for that base label whenHARRIER_BENCH_AUTOTRIAGE_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX}is set - additionally prints eval
pipeline_failed*pipeline verdict-gate early-stopped rows (plus inconclusive/reject splits) for that base label - additionally prints eval pipeline/recheck verdict-gate early-stop
unknown-only rows for that base label (non-family-scoped unknown verdict
coverage via
--only-{pipeline,recheck}-early-stop-unknown) - additionally prints eval
actionable_recheck_failed*recheck verdict-gate runs-completed rows for that base label: scoped (known-only by default, widens to--outcome-filter actionable_recheck_failed_lookup_not_keep,actionable_recheck_failed_low_load_hit_geo_regression,actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regressionwhenHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits - additionally prints configured-range eval
actionable_recheck_failed*recheck verdict-gate runs-completed rows for that base label whenHARRIER_BENCH_AUTOTRIAGE_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX}is set - additionally prints eval
actionable_recheck_failed*recheck verdict-gate early-stopped rows (plus inconclusive/reject splits) for that base label - additionally prints autotriage pipeline/recheck verdict-gate early-stop
unknown-only rows for that base label (non-family-scoped unknown verdict
coverage via
--only-{pipeline,recheck}-early-stop-unknown) - additionally prints streak-instability rows for that base label
(
streak_actionable_before_failure=yes) - additionally prints first-eval-actionable-then-failed rows for that base
label (rows where first eval reached actionable state and still failed,
including
actionable_recheck_failed*first-eval outcomes) - additionally prints known-only first-eval-actionable-then-failed rows for
that base label (all
yes|norows, excludingunknown) - additionally prints unknown-only first-eval-actionable-then-failed rows
for that base label (
first_eval_actionable_then_failed=unknown) - additionally prints yes-only first-eval-actionable-then-failed rows for
that base label (
first_eval_actionable_then_failed=yes) - additionally prints no-only first-eval-actionable-then-failed rows for
that base label (
first_eval_actionable_then_failed=no) - additionally prints first-eval-actionable-then-failed pipeline/recheck
verdict-gate runs-completed rows for that base label:
scoped (known-only by default, widens to yes-only scope when
HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits - additionally prints configured-range first-eval-actionable-then-failed
pipeline/recheck verdict-gate runs-completed rows for that base label when
HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_THEN_FAILED_{PIPELINE,RECHECK}_RUNS_{MIN,MAX}is set - additionally prints yes-only first-eval-actionable-then-failed rows for that base label where pipeline/recheck verdict-gate metadata early-stopped (plus explicit inconclusive/reject row splits)
- additionally prints first-eval actionable-recheck-failed rows for that base
label (
first_eval_outcome=actionable_recheck_failed*) - additionally prints first-eval-failed rows for that base label
(
first_eval_outcomestarts withpipeline_failedoractionable_recheck_failed) - additionally prints first-eval
pipeline_failed*pipeline verdict-gate runs-completed rows for that base label: scoped (known-only by default, widens to--first-eval-outcome-filter pipeline_failedwhenHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits - additionally prints configured-range first-eval
pipeline_failed*pipeline verdict-gate runs-completed rows for that base label whenHARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX}is set - additionally prints first-eval
actionable_recheck_failed*recheck verdict-gate runs-completed rows for that base label: scoped (known-only by default, widens to--first-eval-outcome-filter actionable_recheck_failedwhenHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits - additionally prints configured-range first-eval
actionable_recheck_failed*recheck verdict-gate runs-completed rows for that base label whenHARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX}is set - additionally prints first-eval-lookup-not-keep rows for that base label
(
first_eval_outcomecontainslookup_not_keep) - additionally prints first-eval-low-load-hit-geo-regression rows for that
base label (
first_eval_outcomecontainslow_load_hit_geo_regression) - additionally prints first-eval
actionable-recheck-failed-lookup-not-keep rows for that base label
(
first_eval_outcome=actionable_recheck_failed_lookup_not_keep) - additionally prints first-eval
actionable-recheck-failed-low-load-hit-geo-regression rows for that base
label
(
first_eval_outcome=actionable_recheck_failed_low_load_hit_geo_regression) - additionally prints first-eval actionable-recheck-failed
lookup-not-keep+low-load-hit-geo-regression rows for that base label
(
first_eval_outcome=actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression) - additionally prints first-eval actionable rows for that base label
(
first_eval_outcome=actionable) - additionally prints first-eval pipeline-failed-lookup-not-keep rows for
that base label (
first_eval_outcome=pipeline_failed_lookup_not_keep) - additionally prints first-eval pipeline-failed rows for that base label
(
first_eval_outcome=pipeline_failed*) - additionally prints first-eval
pipeline-failed-low-load-hit-geo-regression rows for that base label
(
first_eval_outcome=pipeline_failed_low_load_hit_geo_regression) - additionally prints first-eval pipeline-failed
lookup-not-keep+low-load-hit-geo-regression rows for that base label
(
first_eval_outcome=pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression) - additionally prints latest-failure-family pipeline-lookup-not-keep rows
for that base label (
latest_failure_family=pipeline_lookup_not_keep) - additionally prints latest-failure-family
pipeline-low-load-hit-geo-regression rows for that base label
(
latest_failure_family=pipeline_low_load_hit_geo_regression) - additionally prints latest-failure-family
pipeline-lookup-not-keep+low-load-hit-geo-regression rows for that base
label
(
latest_failure_family=pipeline_lookup_not_keep+pipeline_low_load_hit_geo_regression) - additionally prints latest-failure-family recheck-lookup-not-keep rows
for that base label (
latest_failure_family=recheck_lookup_not_keep) - additionally prints latest-failure-family
recheck-low-load-hit-geo-regression rows for that base label
(
latest_failure_family=recheck_low_load_hit_geo_regression) - additionally prints latest-failure-family
recheck-lookup-not-keep+low-load-hit-geo-regression rows for that base
label (
latest_failure_family=recheck_lookup_not_keep+low_load_hit_geo_regression) - additionally prints status-board pipeline/recheck verdict-gate early-stop
unknown rows for that base label (via
bench_candidate_status_board.py --only-{pipeline,recheck}-early-stop-unknown) - additionally prints status-board pipeline/recheck verdict-gate
runs-completed known/unknown rows for that base label (via
bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-{known,unknown}) - additionally prints status-board pipeline/recheck verdict-gate
runs-completed exact-3/exact-4 rows for that base label (via
bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-eq {3,4}) - additionally prints status-board pipeline/recheck verdict-gate
runs-completed in-range [3,4] rows for that base label (via
bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-min 3 --only-{pipeline,recheck}-verdict-gate-runs-completed-max 4) - additionally prints status-board pipeline lookup-gate runs-completed
promotion rows for known/unknown/eq3/eq4 and in-range [3,4] slices for
that base label (via
--promotion-filter u64_pipeline_lookup_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus in-range bounds when using_in_range) - additionally prints status-board autotriage pipeline/recheck
verdict-gate runs-completed promotion rows for known/unknown/eq3/eq4 and
in-range [3,4] slices for that base label (via
--promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus in-range bounds when using_in_range) - additionally prints status-board autotriage first-eval
pipeline/recheck verdict-gate runs-completed promotion rows for
known/unknown/eq3/eq4 and in-range [3,4] slices for that base label (via
--promotion-filter u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}oru64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus first-eval in-range bounds when using_in_range) - additionally prints status-board autotriage first-eval-actionable-recheck-failed
recheck verdict-gate runs-completed promotion rows for
known/unknown/eq3/eq4 and in-range [3,4] slices for that base label (via
--promotion-filter u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus first-eval recheck in-range bounds when using_in_range) - additionally prints status-board eval actionable-recheck-failed recheck
verdict-gate runs-completed promotion rows for known/unknown/eq3/eq4 and
in-range [3,4] slices for that base label (via
--promotion-filter u64_eval_actionable_recheck_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus eval actionable-recheck-failed recheck in-range bounds when using_in_range) - additionally prints status-board autotriage first-eval-actionable-then-failed
pipeline/recheck verdict-gate runs-completed promotion rows for
known/unknown/eq3/eq4 and in-range [3,4] slices for that base label (via
--promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus first-eval-actionable-then-failed in-range bounds when using_in_range) - additionally prints status-board autotriage first-eval-actionable-then-failed
pipeline/recheck verdict-gate early-stop promotion rows for
{early_stopped,inconclusive,reject,unknown}slices for that base label (via--promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown}) - additionally prints status-board autotriage first-eval pipeline/recheck
verdict-gate early-stop promotion rows for
{early_stopped,inconclusive,reject,unknown}slices for that base label (via--promotion-filter u64_autotriage_first_eval_{pipeline,recheck_pipeline}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown}) - additionally prints status-board autotriage first-eval-actionable-recheck-failed
recheck verdict-gate early-stop promotion rows for
{early_stopped,inconclusive,reject,unknown}slices for that base label (via--promotion-filter u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown}) - additionally prints status-board eval actionable-recheck-failed recheck
verdict-gate early-stop promotion rows for
{early_stopped,inconclusive,reject,unknown}slices for that base label (via--promotion-filter u64_eval_actionable_recheck_failed_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown}) - additionally prints status-board promotion-filter rows for that base label
across pipeline/eval/autotriage early-stop unknown cohorts (via
bench_candidate_status_board.py --promotion-filter u64_{pipeline_lookup_gate,eval_pipeline_verdict_gate,eval_recheck_pipeline_verdict_gate,autotriage_pipeline_verdict_gate,autotriage_recheck_pipeline_verdict_gate}_early_stopped_unknown) - summaries are emitted after sidecar write, so base-history views include the just-produced autotriage row
- optional global failed-reason board over latest label-base rows
(
HARRIER_BENCH_AUTOTRIAGE_SHOW_GLOBAL_FAILED_REASON_BOARD=1)- prints eval-side failure reasons, autotriage failure families, plus autotriage confirmation-state, confirmation-floor, and streak-actionable-before-failure distributions
- additionally prints latest-failure-family pipeline-lookup-not-keep distribution (latest per base)
- additionally prints latest-failure-family pipeline-low-load-hit-geo-regression distribution (latest per base)
- additionally prints latest-failure-family pipeline-lookup-not-keep+low-load-hit-geo-regression distribution (latest per base)
- additionally prints latest-failure-family recheck-lookup-not-keep distribution (latest per base)
- additionally prints latest-failure-family recheck-low-load-hit-geo-regression distribution (latest per base)
- additionally prints latest-failure-family recheck-lookup-not-keep+low-load-hit-geo-regression distribution (latest per base)
- additionally prints first-eval-outcome distribution
- additionally prints first-eval-failed distribution (latest-label-base rows where first eval failed in pipeline stage or actionable recheck)
- additionally prints first-eval pipeline verdict-gate early-stop verdict distribution (latest-label-base rows where first eval failed and pipeline verdict-gate metadata reports early-stop)
- additionally prints first-eval pipeline verdict-gate early-stop inconclusive/reject/unknown focused distributions (latest-label-base rows where first eval failed and pipeline verdict-gate metadata reports those verdicts)
- additionally prints first-eval recheck verdict-gate early-stop verdict
distribution (latest-label-base rows where first eval is
actionable_recheck_failed*and recheck verdict-gate metadata reports early-stop) - additionally prints first-eval recheck verdict-gate early-stop
inconclusive/reject/unknown focused distributions (latest-label-base rows where
first eval is
actionable_recheck_failed*and recheck verdict-gate metadata reports those verdicts) - additionally prints first-eval
pipeline_failed*verdict-gate runs-completed distributions (latest-label-base rows filtered with--only-first-eval-pipeline-verdict-gate-runs-completed-known; known metadata by default, or includeunknownbuckets withHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1) - additionally prints first-eval
pipeline_failed*verdict-gate runs-completed unknown-only boards (latest-label-base rows filtered with--only-first-eval-pipeline-verdict-gate-runs-completed-unknown) - additionally prints first-eval
pipeline_failed*verdict-gate runs-completed exact-3 and exact-4 boards (latest-label-base rows filtered with--only-first-eval-pipeline-verdict-gate-runs-completed-eq {3,4}) - additionally prints first-eval
actionable_recheck_failed*recheck verdict-gate runs-completed distributions (latest-label-base rows filtered with--only-first-eval-recheck-verdict-gate-runs-completed-known; known metadata by default, or includeunknownbuckets withHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1) - additionally prints first-eval
actionable_recheck_failed*recheck verdict-gate runs-completed unknown-only boards (latest-label-base rows filtered with--only-first-eval-recheck-verdict-gate-runs-completed-unknown) - additionally prints first-eval
actionable_recheck_failed*recheck verdict-gate runs-completed exact-3 and exact-4 boards (latest-label-base rows filtered with--only-first-eval-recheck-verdict-gate-runs-completed-eq {3,4}) - additionally prints eval-side
pipeline_failed*verdict-gate early-stop verdict distribution (latest-label-base rows where eval outcome starts withpipeline_failedand pipeline verdict-gate metadata reports early-stop) - additionally prints eval-side
pipeline_failed*verdict-gate runs-completed distributions (latest-label-base rows where eval outcome starts withpipeline_failedand pipeline run-count metadata is known by default; setHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1to includeunknownbuckets) - additionally prints eval-side
pipeline_failed*verdict-gate runs-completed unknown-only boards (latest-label-base rows filtered with--only-pipeline-failed-verdict-gate-runs-completed-unknown) - additionally prints eval-side
pipeline_failed*verdict-gate runs-completed exact-3 and exact-4 boards (latest-label-base rows filtered with--pipeline-verdict-gate-runs-completed-eq {3,4}) - additionally prints eval-side
pipeline_failed*verdict-gate runs-completed configured-range boards (latest-label-base rows filtered byHARRIER_BENCH_AUTOTRIAGE_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX}when set) - additionally prints eval-side
pipeline_failed*verdict-gate early-stop inconclusive/reject/unknown focused distributions (latest-label-base rows where eval outcome starts withpipeline_failedand pipeline verdict-gate metadata reports those verdict classes) - additionally prints generic eval pipeline/recheck verdict-gate early-stop
unknown-only boards (latest-label-base rows filtered with
--only-{pipeline,recheck}-early-stop-unknown) - additionally prints eval-side
actionable_recheck_failed*verdict-gate early-stop verdict distribution (latest-label-base rows where eval outcome starts withactionable_recheck_failedand recheck verdict-gate metadata reports early-stop) - additionally prints eval-side
actionable_recheck_failed*recheck verdict-gate runs-completed distributions (latest-label-base rows where eval outcome starts withactionable_recheck_failedand recheck run-count metadata is known by default; setHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1to includeunknownbuckets) - additionally prints eval-side
actionable_recheck_failed*recheck verdict-gate runs-completed unknown-only boards (latest-label-base rows filtered with--only-actionable-recheck-failed-verdict-gate-runs-completed-unknown) - additionally prints eval-side
actionable_recheck_failed*recheck verdict-gate runs-completed exact-3 and exact-4 boards (latest-label-base rows filtered with--recheck-verdict-gate-runs-completed-eq {3,4}) - additionally prints eval-side
actionable_recheck_failed*recheck verdict-gate runs-completed configured-range boards (latest-label-base rows filtered byHARRIER_BENCH_AUTOTRIAGE_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX}when set) - additionally prints eval-side
actionable_recheck_failed*verdict-gate early-stop inconclusive/reject/unknown focused distributions (latest-label-base rows where eval outcome starts withactionable_recheck_failedand recheck verdict-gate metadata reports those verdict classes) - additionally prints generic autotriage pipeline/recheck verdict-gate
early-stop unknown-only boards (latest-label-base rows filtered with
--only-{pipeline,recheck}-early-stop-unknown) - additionally prints status-board promotion-filter unknown boards for pipeline/eval/autotriage verdict-gate cohorts (latest per label base)
- additionally prints status-board pipeline lookup-gate runs-completed
promotion boards for known/unknown/eq3/eq4/in-range [3,4] slices (latest
per label base, via
--promotion-filter u64_pipeline_lookup_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus in-range bounds when using_in_range) - additionally prints status-board pipeline/recheck verdict-gate
runs-completed known/unknown boards (latest per label base, via
bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-{known,unknown}) - additionally prints status-board pipeline/recheck verdict-gate
runs-completed exact-3/exact-4 boards (latest per label base, via
bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-eq {3,4}) - additionally prints status-board pipeline/recheck verdict-gate
runs-completed in-range [3,4] boards (latest per label base, via
bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-min 3 --only-{pipeline,recheck}-verdict-gate-runs-completed-max 4) - additionally prints status-board autotriage pipeline/recheck verdict-gate
runs-completed known/unknown promotion boards (latest per label base, via
bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown}) - additionally prints status-board autotriage pipeline/recheck verdict-gate
runs-completed exact-3/exact-4 promotion boards (latest per label base,
via
bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_{eq3,eq4}) - additionally prints status-board autotriage pipeline/recheck verdict-gate
runs-completed in-range [3,4] promotion boards (latest per label base,
via
bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_in_range --only-{pipeline,recheck}-verdict-gate-runs-completed-min 3 --only-{pipeline,recheck}-verdict-gate-runs-completed-max 4) - additionally prints status-board autotriage first-eval
pipeline/recheck verdict-gate runs-completed known/unknown/eq3/eq4 and
in-range [3,4] promotion boards (latest per label base, via
bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}oru64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus--only-first-eval-{pipeline,recheck}-verdict-gate-runs-completed-{min,max}bounds when using_in_range) - additionally prints status-board autotriage first-eval-actionable-recheck-failed
recheck verdict-gate runs-completed known/unknown/eq3/eq4 and in-range
[3,4] promotion boards (latest per label base, via
bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus--only-first-eval-recheck-verdict-gate-runs-completed-{min,max}bounds when using_in_range) - additionally prints status-board eval actionable-recheck-failed recheck
verdict-gate runs-completed known/unknown/eq3/eq4 and in-range [3,4]
promotion boards (latest per label base, via
bench_candidate_status_board.py --promotion-filter u64_eval_actionable_recheck_failed_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus--only-eval-actionable-recheck-failed-verdict-gate-runs-completed-{min,max}bounds when using_in_range) - additionally prints status-board autotriage first-eval-actionable-then-failed
pipeline/recheck verdict-gate runs-completed known/unknown/eq3/eq4 and
in-range [3,4] promotion boards (latest per label base, via
bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus--only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-runs-completed-{min,max}bounds when using_in_range) - additionally prints status-board autotriage first-eval-actionable-then-failed
pipeline/recheck verdict-gate early-stop promotion boards (latest per
label base) for
{early_stopped,inconclusive,reject,unknown}slices (viabench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown}) - when
HARRIER_BENCH_AUTOTRIAGE_SHOW_GLOBAL_FAILED_REASON_BOARD=1andHARRIER_BENCH_AUTOTRIAGE_SHOW_EXTENDED_SUMMARIES=0, compact global output still includes first-eval-actionable-then-failed early-stop status-board promotion boards for pipeline/recheck (plus their*_unknownslices), and first-eval pipeline/recheck early-stop promotion boards (plus their*_unknownslices) - additionally prints status-board autotriage first-eval pipeline/recheck
verdict-gate early-stop promotion boards (latest per label base) for
{early_stopped,inconclusive,reject,unknown}slices (viabench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_{pipeline,recheck_pipeline}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown}) - additionally prints status-board autotriage first-eval-actionable-recheck-failed
recheck verdict-gate early-stop promotion boards (latest per label base)
for
{early_stopped,inconclusive,reject,unknown}slices (viabench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_recheck_failed_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown}) - additionally prints status-board eval actionable-recheck-failed recheck
verdict-gate early-stop promotion boards (latest per label base) for
{early_stopped,inconclusive,reject,unknown}slices (viabench_candidate_status_board.py --promotion-filter u64_eval_actionable_recheck_failed_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown}) - additionally prints first-eval-lookup-not-keep distribution (latest-label-base rows where first eval indicates lookup keep failure in pipeline or actionable recheck)
- additionally prints first-eval-low-load-hit-geo-regression distribution (latest-label-base rows where first eval indicates low-load hit-geo regression in pipeline or actionable recheck)
- additionally prints first-eval actionable distribution (latest-label-base rows where first eval reached actionable outcome)
- additionally prints first-eval actionable-recheck-failed distribution (latest-label-base rows where first eval already failed during actionable recheck)
- additionally prints first-eval actionable-recheck-failed-lookup-not-keep distribution (latest-label-base rows where first eval failed in actionable recheck due to lookup keep gate)
- additionally prints first-eval actionable-recheck-failed-low-load-hit-geo-regression distribution (latest-label-base rows where first eval failed in actionable recheck due to low-load hit-geo gate)
- additionally prints first-eval actionable-recheck-failed lookup-not-keep+low-load-hit-geo-regression distribution (latest-label-base rows where first eval failed in actionable recheck due to combined lookup-keep and low-load hit-geo gates)
- additionally prints first-eval pipeline-failed-lookup-not-keep distribution (latest-label-base rows where first eval failed lookup keep immediately)
- additionally prints first-eval pipeline-failed distribution (latest-label-base rows where first eval failed in pipeline stage)
- additionally prints first-eval pipeline-failed-low-load-hit-geo-regression distribution (latest-label-base rows where first eval failed on low-load hit-geo gate)
- additionally prints first-eval pipeline-failed lookup-not-keep+low-load-hit-geo-regression distribution (latest-label-base rows where first eval failed both lookup keep and low-load hit-geo gates)
- additionally prints first-eval-actionable-then-failed distribution
- additionally prints first-eval-actionable-then-failed known-only
distribution (latest-label-base rows with
unknownexcluded) - additionally prints first-eval-actionable-then-failed unknown-only distribution (latest-label-base rows where telemetry remains unknown)
- additionally prints first-eval-actionable-then-failed yes-only distribution (latest-label-base rows where first eval was actionable but later runs failed)
- additionally prints first-eval-actionable-then-failed yes-only
pipeline/recheck verdict-gate runs-completed distributions
(known-only by default; widens to include unknown run-count metadata when
HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1) - additionally prints first-eval-actionable-then-failed yes-only pipeline/recheck verdict-gate runs-completed known-only distributions
- additionally prints first-eval-actionable-then-failed yes-only pipeline/recheck verdict-gate runs-completed unknown-only distributions
- additionally prints first-eval-actionable-then-failed yes-only pipeline/recheck verdict-gate runs-completed exact-3 and exact-4 distributions
- additionally prints first-eval-actionable-then-failed yes-only
pipeline/recheck verdict-gate runs-completed configured-range
distributions when
HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_THEN_FAILED_{PIPELINE,RECHECK}_RUNS_{MIN,MAX}is set - additionally prints first-eval-actionable-then-failed yes-only pipeline/recheck verdict-gate early-stop distributions (plus explicit inconclusive and reject boards)
- additionally prints first-eval-actionable-then-failed no-only distribution (latest-label-base rows where first eval was instrumented but not actionable)
- additionally prints latest-label-base eval
pipeline_failed*pipeline verdict-gate runs-completed rows: scoped (known-only by default, widened withHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits - additionally prints latest-label-base eval
pipeline_failed*pipeline verdict-gate runs-completed configured-range rows whenHARRIER_BENCH_AUTOTRIAGE_EVAL_PIPELINE_FAILED_PIPELINE_RUNS_{MIN,MAX}is set - additionally prints latest-label-base eval
pipeline_failed*pipeline verdict-gate early-stopped rows (plus inconclusive/reject splits) - additionally prints latest-label-base eval pipeline/recheck verdict-gate
early-stop unknown-only rows (filtered with
--only-{pipeline,recheck}-early-stop-unknown) - additionally prints latest-label-base eval
actionable_recheck_failed*recheck verdict-gate runs-completed rows: scoped (known-only by default, widened withHARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits - additionally prints latest-label-base eval
actionable_recheck_failed*recheck verdict-gate runs-completed configured-range rows whenHARRIER_BENCH_AUTOTRIAGE_EVAL_ACTIONABLE_RECHECK_FAILED_RECHECK_RUNS_{MIN,MAX}is set - additionally prints latest-label-base eval
actionable_recheck_failed*recheck verdict-gate early-stopped rows (plus inconclusive/reject splits) - additionally prints latest-label-base streak-instability rows
(
streak_actionable_before_failure=yes) (latest per normalized base label) - additionally prints latest-label-base first-eval-actionable-then-failed
rows (
first_eval_outcome=actionableand non-zero eval exit status) - additionally prints latest-label-base known-only
first-eval-actionable-then-failed rows (
yes|no, excludingunknown) - additionally prints latest-label-base unknown-only
first-eval-actionable-then-failed rows (
first_eval_actionable_then_failed=unknown) - additionally prints latest-label-base yes-only
first-eval-actionable-then-failed rows (
first_eval_actionable_then_failed=yes) - additionally prints latest-label-base no-only
first-eval-actionable-then-failed rows (
first_eval_actionable_then_failed=no) - additionally prints latest-label-base first-eval-actionable-then-failed
pipeline/recheck verdict-gate runs-completed rows:
scoped (known-only by default, widens to yes-only scope when
HARRIER_BENCH_AUTOTRIAGE_RUN_COUNT_INCLUDE_UNKNOWN=1), explicit known-only, unknown-only, and exact-3/exact-4 splits - additionally prints latest-label-base configured-range
first-eval-actionable-then-failed pipeline/recheck verdict-gate
runs-completed rows when
HARRIER_BENCH_AUTOTRIAGE_FIRST_EVAL_ACTIONABLE_THEN_FAILED_{PIPELINE,RECHECK}_RUNS_{MIN,MAX}is set - additionally prints latest-label-base yes-only first-eval-actionable-then-failed rows where recheck verdict-gate metadata early-stopped (plus explicit inconclusive/reject row splits)
- additionally prints latest-label-base first-eval actionable-recheck-failed
rows (
first_eval_outcome=actionable_recheck_failed*) - additionally prints latest-label-base first-eval-failed rows
(
first_eval_outcomestarts withpipeline_failedoractionable_recheck_failed) - additionally prints latest-label-base first-eval pipeline verdict-gate
early-stopped rows (
first_eval_outcomestarts withpipeline_failedoractionable_recheck_failed, pluspipeline_lookup_verdict_gate_early_stopped=yes) - additionally prints latest-label-base first-eval recheck verdict-gate
early-stopped rows (
first_eval_outcome=actionable_recheck_failed*plusrecheck_pipeline_lookup_verdict_gate_early_stopped=yes) - additionally prints latest-label-base autotriage pipeline/recheck
verdict-gate early-stop unknown-only rows (filtered with
--only-{pipeline,recheck}-early-stop-unknown) - additionally prints latest-label-base status-board promotion-filter unknown rows for pipeline/eval/autotriage verdict-gate cohorts
- additionally prints latest-label-base status-board pipeline lookup-gate
runs-completed promotion rows for known/unknown/eq3/eq4/in-range [3,4]
slices (via
--promotion-filter u64_pipeline_lookup_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus in-range bounds when using_in_range) - additionally prints latest-label-base status-board pipeline/recheck
verdict-gate runs-completed known/unknown rows (via
bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-{known,unknown}) - additionally prints latest-label-base status-board pipeline/recheck
verdict-gate runs-completed exact-3/exact-4 rows (via
bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-eq {3,4}) - additionally prints latest-label-base status-board pipeline/recheck
verdict-gate runs-completed in-range [3,4] rows (via
bench_candidate_status_board.py --only-{pipeline,recheck}-verdict-gate-runs-completed-min 3 --only-{pipeline,recheck}-verdict-gate-runs-completed-max 4) - additionally prints latest-label-base status-board autotriage
pipeline/recheck verdict-gate runs-completed known/unknown promotion rows
(via
bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown}) - additionally prints latest-label-base status-board autotriage
pipeline/recheck verdict-gate runs-completed exact-3/exact-4 promotion
rows (via
bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_{eq3,eq4}) - additionally prints latest-label-base status-board autotriage
pipeline/recheck verdict-gate runs-completed in-range [3,4] promotion
rows (via
bench_candidate_status_board.py --promotion-filter u64_autotriage_{pipeline,recheck}_verdict_gate_runs_completed_in_range --only-{pipeline,recheck}-verdict-gate-runs-completed-min 3 --only-{pipeline,recheck}-verdict-gate-runs-completed-max 4) - additionally prints latest-label-base status-board autotriage first-eval
pipeline/recheck verdict-gate runs-completed known/unknown/eq3/eq4 and
in-range [3,4] promotion rows (via
bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}oru64_autotriage_first_eval_recheck_pipeline_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus--only-first-eval-{pipeline,recheck}-verdict-gate-runs-completed-{min,max}bounds when using_in_range) - additionally prints latest-label-base status-board autotriage
first-eval-actionable-then-failed pipeline/recheck verdict-gate
runs-completed known/unknown/eq3/eq4 and in-range [3,4] promotion rows
(via
bench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_runs_completed_{known,unknown,eq3,eq4,in_range}plus--only-first-eval-actionable-then-failed-{pipeline,recheck}-verdict-gate-runs-completed-{min,max}bounds when using_in_range) - additionally prints latest-label-base status-board autotriage
first-eval-actionable-then-failed pipeline/recheck verdict-gate
early-stop promotion rows for
{early_stopped,inconclusive,reject,unknown}slices (viabench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_actionable_then_failed_{pipeline,recheck}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown}) - additionally prints latest-label-base status-board autotriage first-eval
pipeline/recheck verdict-gate early-stop promotion rows for
{early_stopped,inconclusive,reject,unknown}slices (viabench_candidate_status_board.py --promotion-filter u64_autotriage_first_eval_{pipeline,recheck_pipeline}_verdict_gate_early_stopped{,_inconclusive,_reject,_unknown}) - additionally prints latest-label-base first-eval-lookup-not-keep rows
(
first_eval_outcomecontainslookup_not_keep) - additionally prints latest-label-base
first-eval-low-load-hit-geo-regression rows
(
first_eval_outcomecontainslow_load_hit_geo_regression) - additionally prints latest-label-base first-eval
actionable-recheck-failed-lookup-not-keep rows
(
first_eval_outcome=actionable_recheck_failed_lookup_not_keep) - additionally prints latest-label-base first-eval
actionable-recheck-failed-low-load-hit-geo-regression rows
(
first_eval_outcome=actionable_recheck_failed_low_load_hit_geo_regression) - additionally prints latest-label-base first-eval actionable-recheck-failed
lookup-not-keep+low-load-hit-geo-regression rows
(
first_eval_outcome=actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression) - additionally prints latest-label-base first-eval actionable rows
(
first_eval_outcome=actionable) - additionally prints latest-label-base first-eval
pipeline-failed-lookup-not-keep rows
(
first_eval_outcome=pipeline_failed_lookup_not_keep) - additionally prints latest-label-base first-eval pipeline-failed rows
(
first_eval_outcome=pipeline_failed*) - additionally prints latest-label-base first-eval
pipeline-failed-low-load-hit-geo-regression rows
(
first_eval_outcome=pipeline_failed_low_load_hit_geo_regression) - additionally prints latest-label-base first-eval pipeline-failed
lookup-not-keep+low-load-hit-geo-regression rows
(
first_eval_outcome=pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression) - additionally prints latest-label-base latest-failure-family
pipeline-lookup-not-keep rows
(
latest_failure_family=pipeline_lookup_not_keep) - additionally prints latest-label-base latest-failure-family
pipeline-low-load-hit-geo-regression rows
(
latest_failure_family=pipeline_low_load_hit_geo_regression) - additionally prints latest-label-base latest-failure-family
pipeline-lookup-not-keep+low-load-hit-geo-regression rows
(
latest_failure_family=pipeline_lookup_not_keep+pipeline_low_load_hit_geo_regression) - additionally prints latest-label-base latest-failure-family
recheck-lookup-not-keep rows
(
latest_failure_family=recheck_lookup_not_keep) - additionally prints latest-label-base latest-failure-family
recheck-low-load-hit-geo-regression rows
(
latest_failure_family=recheck_low_load_hit_geo_regression) - additionally prints latest-label-base latest-failure-family
recheck-lookup-not-keep+low-load-hit-geo-regression rows
(
latest_failure_family=recheck_lookup_not_keep+low_load_hit_geo_regression)
- by default appends a timestamp suffix to the provided label to avoid history
collisions (
HARRIER_BENCH_AUTOTRIAGE_APPEND_STAMP=1) - supports optional multi-run confirmation via
HARRIER_BENCH_AUTOTRIAGE_CONFIRMATION_RUNS(default1), with an actionable-success floor controlled byHARRIER_BENCH_AUTOTRIAGE_MIN_CONFIRMATION_RUNS_ON_SUCCESS(default2):- effective run target is
max(confirmation_runs, min_confirmation_runs_on_success) - run
1uses the stamped candidate label - runs
2+append-streakNsuffixes - wrapper stops at the first failed eval run and reports that run's label/outcome
- sidecar
candidate_label_baseremains anchored to the requested base label, while the executed streak label is recorded vialast_eval_label - sidecar metadata includes
confirmation_runs(effective target),confirmation_runs_requested,min_confirmation_runs_on_success, andconfirmation_floor_applied(yeswhen the minimum-success floor raised effective confirmation runs above requested runs)
- effective run target is
- by default refuses to run when the git working tree is dirty (excluding
untracked files) so candidate labels always map to committed code; override
with
HARRIER_BENCH_AUTOTRIAGE_ALLOW_DIRTY_TREE=1when intentionally benchmarking uncommitted edits - optionally writes an autotriage sidecar report
(
HARRIER_BENCH_AUTOTRIAGE_WRITE_REPORT=1) toHARRIER_BENCH_AUTOTRIAGE_SUMMARY_DIR(default:benchmarks/results), containing eval exit status and latest exact-label outcome fields, includinglatest_confirmed_actionable_rows,latest_failure_reason, andlatest_failure_family, plus run-sequence diagnostics:first_eval_label,first_eval_outcome,first_eval_exit_statusstreak_actionable_before_failure(flags unstable streaks where an earlier run was actionable before a later failure)run_outcome_sequence(label:outcome:exit_statusfragments joined by;) and metadata fields forlast_eval_label,confirmation_runs, andcompleted_eval_runs. Sidecar rows include both:candidate_label(the latest executed eval label, e.g.-streak2)requested_candidate_label(the originally requested autotriage label)
Override any eval env var as needed, for example:
HARRIER_BENCH_CANDIDATE_LABEL=u64-my-candidate \
HARRIER_BENCH_EVAL_DRY_RUN=1 \
bash scripts/bench_lookup_u64_candidate_autotriage.shDisable timestamp suffixing for fixed-label reruns:
HARRIER_BENCH_CANDIDATE_LABEL=u64-my-candidate \
HARRIER_BENCH_AUTOTRIAGE_APPEND_STAMP=0 \
bash scripts/bench_lookup_u64_candidate_autotriage.shTo summarize candidate recheck reports (confirmation-window artifacts), use:
python3 scripts/bench_candidate_recheck_outcomes.py --kind all --limit 20Useful filters:
--kind lookup|insert|all--label-filter <substring>--limit <N>(-1keeps all rows)
To summarize u64 eval outcomes (including pipeline/recheck reason row counts), use:
python3 scripts/bench_u64_eval_outcomes.py --limit 20Useful filters:
--label-filter <substring>--label-filter-exact(requires--label-filter; exact label match)--label-base-filter <substring>--label-base-filter-exact(requires--label-base-filter)--outcome-filter <csv>--failure-family-filter <substring>--failure-family-filter-exact(requires--failure-family-filter; combined pipeline/recheck lookup+low-load families accept either canonical...+low_load_hit_geo_regressionnames or legacy...+pipeline_low_load_hit_geo_regression/...+recheck_low_load_hit_geo_regressionaliases)--pipeline-early-stop-verdict-filter <substring>--pipeline-early-stop-verdict-filter-exact(requires--pipeline-early-stop-verdict-filter)--recheck-early-stop-verdict-filter <substring>--recheck-early-stop-verdict-filter-exact(requires--recheck-early-stop-verdict-filter)--pipeline-verdict-gate-runs-completed-eq <N>(known numeric pipeline run-count rows with value exactlyN)--pipeline-verdict-gate-runs-completed-min <N>(known numeric pipeline run-count rows with value >=N)--pipeline-verdict-gate-runs-completed-max <N>(known numeric pipeline run-count rows with value <=N)--recheck-verdict-gate-runs-completed-eq <N>(known numeric recheck run-count rows with value exactlyN)--recheck-verdict-gate-runs-completed-min <N>(known numeric recheck run-count rows with value >=N)--recheck-verdict-gate-runs-completed-max <N>(known numeric recheck run-count rows with value <=N)--only-failed--only-pipeline-low-load-skipped(rows where eval-side status-board snapshots report low-load stage skipped)--only-pipeline-low-load-skipped-lookup-not-keep(subset where the skip was specifically due to lookup verdict not keep)--only-pipeline-verdict-gate-early-stopped(rows whose first-pass pipeline verdict metadata reports early-stop)--only-recheck-verdict-gate-early-stopped(rows whose recheck pipeline verdict metadata reports early-stop)--only-pipeline-verdict-gate-runs-completed-known(rows whose first-pass pipeline verdict-gate run-count metadata is numeric/present)--only-recheck-verdict-gate-runs-completed-known(rows whose recheck verdict-gate run-count metadata is numeric/present)--only-pipeline-early-stop-reject--only-pipeline-early-stop-inconclusive--only-pipeline-early-stop-unknown--only-recheck-early-stop-reject--only-recheck-early-stop-inconclusive--only-recheck-early-stop-unknown--only-pipeline-failed-verdict-gate-early-stopped(rows whereeval_outcomestarts withpipeline_failedand first-pass verdict-gate metadata reports early-stop)--only-pipeline-failed-verdict-gate-early-stopped-inconclusive--only-pipeline-failed-verdict-gate-early-stopped-reject--only-pipeline-failed-verdict-gate-early-stopped-unknown--only-pipeline-failed-verdict-gate-runs-completed-known(rows whereeval_outcomestarts withpipeline_failedand first-pass verdict-gate run-count metadata is numeric/present)--only-pipeline-failed-verdict-gate-runs-completed-unknown(rows whereeval_outcomestarts withpipeline_failedand first-pass verdict-gate run-count metadata is missing/unknown)--only-actionable-recheck-failed-verdict-gate-early-stopped(rows whereeval_outcomestarts withactionable_recheck_failedand recheck verdict-gate metadata reports early-stop)--only-actionable-recheck-failed-verdict-gate-early-stopped-inconclusive--only-actionable-recheck-failed-verdict-gate-early-stopped-reject--only-actionable-recheck-failed-verdict-gate-early-stopped-unknown--only-actionable-recheck-failed-verdict-gate-runs-completed-known(rows whereeval_outcomestarts withactionable_recheck_failedand recheck verdict-gate run-count metadata is numeric/present)--only-actionable-recheck-failed-verdict-gate-runs-completed-unknown(rows whereeval_outcomestarts withactionable_recheck_failedand recheck verdict-gate run-count metadata is missing/unknown)--only-actionable-confirmed(rows witheval_outcome=actionableandconfirmed_actionable_rows > 0)--only-attributed-failures(rows with non-empty derivedfailure_reason)--latest-label(collapse to newest eval report per candidate label)--latest-label-base(collapse to newest eval report per derived base label)--summary(aggregate counts byeval_outcome)--summary-key eval_outcome|pipeline_final_message|recheck_pipeline_final_message|failure_reason|candidate_label_base|latest_failure_family|pipeline_lookup_verdict_gate_runs_completed|pipeline_lookup_verdict_gate_early_stopped|pipeline_lookup_verdict_gate_early_stop_verdict|recheck_pipeline_lookup_verdict_gate_runs_completed|recheck_pipeline_lookup_verdict_gate_early_stopped|recheck_pipeline_lookup_verdict_gate_early_stop_verdict--limit <N>(-1keeps all rows)
To summarize u64 autotriage sidecars, use:
python3 scripts/bench_u64_autotriage_outcomes.py --limit 20Useful filters:
--label-filter <substring>--label-filter-exact(requires--label-filter)--label-base-filter <substring>--label-base-filter-exact(requires--label-base-filter)--requested-label-filter <substring>--requested-label-filter-exact(requires--requested-label-filter)--last-eval-label-filter <substring>--last-eval-label-filter-exact(requires--last-eval-label-filter)--only-failed--only-confirmation-multi-run(rows withconfirmation_runs > 1)--only-confirmation-incomplete(rows wherecompleted_eval_runs < confirmation_runs)--only-confirmation-complete(rows wherecompleted_eval_runs >= confirmation_runs)--only-confirmation-floor-applied(rows where effectiveconfirmation_runs > confirmation_runs_requested)--only-confirmation-floor-not-applied(rows where effectiveconfirmation_runs == confirmation_runs_requested)--only-streak-actionable-before-failure(rows wherestreak_actionable_before_failure=yes)--only-first-eval-actionable(rows wherefirst_eval_outcome=actionable)--only-first-eval-lookup-not-keep(rows wherefirst_eval_outcomecontainslookup_not_keep)--only-first-eval-low-load-hit-geo-regression(rows wherefirst_eval_outcomecontainslow_load_hit_geo_regression)--only-first-eval-failed(rows wherefirst_eval_outcomestarts withpipeline_failedoractionable_recheck_failed)--only-first-eval-actionable-recheck-failed(rows wherefirst_eval_outcomestarts withactionable_recheck_failed)--only-first-eval-actionable-recheck-failed-lookup-not-keep(rows wherefirst_eval_outcome=actionable_recheck_failed_lookup_not_keep)--only-first-eval-actionable-recheck-failed-low-load-hit-geo-regression(rows wherefirst_eval_outcome=actionable_recheck_failed_low_load_hit_geo_regression)--only-first-eval-actionable-recheck-failed-lookup-not-keep-and-low-load-hit-geo-regression(rows wherefirst_eval_outcome=actionable_recheck_failed_lookup_not_keep_and_low_load_hit_geo_regression)--only-first-eval-pipeline-failed-lookup-not-keep(rows wherefirst_eval_outcome=pipeline_failed_lookup_not_keep)--only-first-eval-pipeline-failed(rows wherefirst_eval_outcomestarts withpipeline_failed)--only-first-eval-pipeline-failed-low-load-hit-geo-regression(rows wherefirst_eval_outcome=pipeline_failed_low_load_hit_geo_regression)--only-first-eval-pipeline-failed-lookup-not-keep-and-low-load-hit-geo-regression(rows wherefirst_eval_outcome=pipeline_failed_lookup_not_keep_and_low_load_hit_geo_regression)--only-first-eval-pipeline-verdict-gate-early-stopped(rows wherefirst_eval_outcomestarts withpipeline_failedand pipeline verdict-gate metadata reports early-stop)--only-first-eval-pipeline-verdict-gate-early-stopped-inconclusive(rows wherefirst_eval_outcomestarts withpipeline_failedand pipeline verdict-gate early-stop verdict isinconclusive)--only-first-eval-pipeline-verdict-gate-early-stopped-reject(rows wherefirst_eval_outcomestarts withpipeline_failedand pipeline verdict-gate early-stop verdict isreject)--only-first-eval-pipeline-verdict-gate-early-stopped-unknown(rows wherefirst_eval_outcomestarts withpipeline_failed, pipeline verdict-gate metadata reports early-stop, and the early-stop verdict is missing/unknown)--only-first-eval-recheck-verdict-gate-early-stopped(rows wherefirst_eval_outcomestarts withactionable_recheck_failedand recheck verdict-gate metadata reports early-stop)--only-first-eval-recheck-verdict-gate-early-stopped-inconclusive(rows wherefirst_eval_outcomestarts withactionable_recheck_failedand recheck verdict-gate early-stop verdict isinconclusive)--only-first-eval-recheck-verdict-gate-early-stopped-reject(rows wherefirst_eval_outcomestarts withactionable_recheck_failedand recheck verdict-gate early-stop verdict isreject)--only-first-eval-recheck-verdict-gate-early-stopped-unknown(rows wherefirst_eval_outcomestarts withactionable_recheck_failed, recheck verdict-gate metadata reports early-stop, and the verdict is missing/unknown)--only-first-eval-pipeline-verdict-gate-runs-completed-known(rows wherefirst_eval_outcomestarts withpipeline_failedand pipeline verdict-gate runs-completed metadata is numeric)--only-first-eval-pipeline-verdict-gate-runs-completed-unknown(rows wherefirst_eval_outcomestarts withpipeline_failedand pipeline verdict-gate runs-completed metadata is unknown)--only-first-eval-recheck-verdict-gate-runs-completed-known(rows wherefirst_eval_outcomestarts withactionable_recheck_failedand recheck verdict-gate runs-completed metadata is numeric)--only-first-eval-recheck-verdict-gate-runs-completed-unknown(rows wherefirst_eval_outcomestarts withactionable_recheck_failedand recheck verdict-gate runs-completed metadata is unknown)--only-first-eval-pipeline-verdict-gate-runs-completed-eq <N>(rows wherefirst_eval_outcomestarts withpipeline_failedand pipeline verdict-gate runs-completed equals<N>)--only-first-eval-pipeline-verdict-gate-runs-completed-min <N>(rows wherefirst_eval_outcomestarts withpipeline_failedand pipeline verdict-gate runs-completed is>= <N>)--only-first-eval-pipeline-verdict-gate-runs-completed-max <N>(rows wherefirst_eval_outcomestarts withpipeline_failedand pipeline verdict-gate runs-completed is<= <N>)--only-first-eval-recheck-verdict-gate-runs-completed-eq <N>(rows wherefirst_eval_outcomestarts withactionable_recheck_failedand recheck verdict-gate runs-completed equals<N>)--only-first-eval-recheck-verdict-gate-runs-completed-min <N>(rows wherefirst_eval_outcomestarts withactionable_recheck_failedand recheck verdict-gate runs-completed is>= <N>)--only-first-eval-recheck-verdict-gate-runs-completed-max <N>(rows wherefirst_eval_outcomestarts withactionable_recheck_failedand recheck verdict-gate runs-completed is<= <N>)--only-first-eval-actionable-then-failed(rows wherefirst_eval_actionable_then_failed=yes)--only-first-eval-actionable-then-failed-yes(rows wherefirst_eval_actionable_then_failed=yes; explicit alias)--only-first-eval-actionable-then-failed-known(rows wherefirst_eval_actionable_then_failedisyesorno)--only-first-eval-actionable-then-failed-unknown(rows wherefirst_eval_actionable_then_failed=unknown)--only-first-eval-actionable-then-failed-no(rows wherefirst_eval_actionable_then_failed=no)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped(rows wherefirst_eval_actionable_then_failed=yesand pipeline verdict-gate metadata reports early-stop)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-inconclusive(rows wherefirst_eval_actionable_then_failed=yesand pipeline verdict-gate early-stop verdict isinconclusive)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-reject(rows wherefirst_eval_actionable_then_failed=yesand pipeline verdict-gate early-stop verdict isreject)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-early-stopped-unknown(rows wherefirst_eval_actionable_then_failed=yes, pipeline verdict-gate reports early-stop, and the verdict is missing/unknown)--only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped(rows wherefirst_eval_actionable_then_failed=yesand recheck verdict-gate metadata reports early-stop)--only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-inconclusive(rows wherefirst_eval_actionable_then_failed=yesand recheck verdict-gate early-stop verdict isinconclusive)--only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-reject(rows wherefirst_eval_actionable_then_failed=yesand recheck verdict-gate early-stop verdict isreject)--only-first-eval-actionable-then-failed-recheck-verdict-gate-early-stopped-unknown(rows wherefirst_eval_actionable_then_failed=yes, recheck verdict-gate reports early-stop, and the verdict is missing/unknown)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-known(rows wherefirst_eval_actionable_then_failed=yesand pipeline verdict-gate runs-completed metadata is numeric)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-unknown(rows wherefirst_eval_actionable_then_failed=yesand pipeline verdict-gate runs-completed metadata is unknown)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-eq <N>(rows wherefirst_eval_actionable_then_failed=yesand pipeline verdict-gate runs-completed equals<N>)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-min <N>(rows wherefirst_eval_actionable_then_failed=yesand pipeline verdict-gate runs-completed is>= <N>)--only-first-eval-actionable-then-failed-pipeline-verdict-gate-runs-completed-max <N>(rows wherefirst_eval_actionable_then_failed=yesand pipeline verdict-gate runs-completed is<= <N>)--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-known(rows wherefirst_eval_actionable_then_failed=yesand recheck verdict-gate runs-completed metadata is numeric)--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-unknown(rows wherefirst_eval_actionable_then_failed=yesand recheck verdict-gate runs-completed metadata is unknown)--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-eq <N>(rows wherefirst_eval_actionable_then_failed=yesand recheck verdict-gate runs-completed equals<N>)--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-min <N>(rows wherefirst_eval_actionable_then_failed=yesand recheck verdict-gate runs-completed is>= <N>)--only-first-eval-actionable-then-failed-recheck-verdict-gate-runs-completed-max <N>(rows wherefirst_eval_actionable_then_failed=yesand recheck verdict-gate runs-completed is<= <N>)--only-streak-no-actionable-before-failure(rows wherestreak_actionable_before_failure=no)--confirmation-state-filter <substring>--confirmation-state-filter-exact(requires--confirmation-state-filter)--confirmation-floor-filter <substring>--confirmation-floor-filter-exact(requires--confirmation-floor-filter)--first-eval-outcome-filter <substring>--first-eval-outcome-filter-exact(requires--first-eval-outcome-filter)--streak-actionable-before-failure-filter <substring>--streak-actionable-before-failure-filter-exact(requires--streak-actionable-before-failure-filter)--first-eval-actionable-then-failed-filter <substring>--first-eval-actionable-then-failed-filter-exact(requires--first-eval-actionable-then-failed-filter)--only-actionable-confirmed(rows withlatest_eval_outcome=actionableandlatest_confirmed_actionable_rows > 0)--failure-family-filter <substring>--failure-family-filter-exact(requires--failure-family-filter)--pipeline-early-stop-verdict-filter <substring>--pipeline-early-stop-verdict-filter-exact(requires--pipeline-early-stop-verdict-filter)--recheck-early-stop-verdict-filter <substring>--recheck-early-stop-verdict-filter-exact(requires--recheck-early-stop-verdict-filter)--pipeline-verdict-gate-runs-completed-eq <N>--pipeline-verdict-gate-runs-completed-min <N>--pipeline-verdict-gate-runs-completed-max <N>--recheck-verdict-gate-runs-completed-eq <N>--recheck-verdict-gate-runs-completed-min <N>--recheck-verdict-gate-runs-completed-max <N>--only-pipeline-verdict-gate-early-stopped--only-recheck-verdict-gate-early-stopped--only-pipeline-verdict-gate-runs-completed-known--only-recheck-verdict-gate-runs-completed-known--only-pipeline-early-stop-reject--only-pipeline-early-stop-inconclusive--only-pipeline-early-stop-unknown--only-recheck-early-stop-reject--only-recheck-early-stop-inconclusive--only-recheck-early-stop-unknown--only-latest-failure-family-pipeline-lookup-not-keep(rows with pipeline lookup-not-keep latest failure families, including combined pipeline lookup+low-load entries)--only-latest-failure-family-pipeline-low-load-hit-geo-regression(rows with pipeline low-load hit-geo latest failure families, including combined pipeline lookup+low-load entries)--only-latest-failure-family-pipeline-lookup-not-keep-and-low-load-hit-geo-regression(rows with combined pipeline lookup-not-keep + low-load-hit-geo latest failure families)--only-latest-failure-family-recheck-lookup-not-keep(rows with recheck lookup-not-keep latest failure families, including combined recheck lookup+low-load entries)--only-latest-failure-family-recheck-low-load-hit-geo-regression(rows with recheck low-load hit-geo latest failure families, including combined recheck lookup+low-load entries)--only-latest-failure-family-recheck-lookup-not-keep-and-low-load-hit-geo-regression(rows with combined recheck lookup-not-keep + low-load-hit-geo latest failure families)--latest-label--latest-label-base--summary--summary-key latest_eval_outcome|latest_failure_reason|latest_failure_family|candidate_label_base|eval_exit_status|confirmation_runs|confirmation_runs_requested|min_confirmation_runs_on_success|confirmation_floor_applied|completed_eval_runs|requested_candidate_label|confirmation_state|last_eval_label|first_eval_outcome|streak_actionable_before_failure|run_outcome_sequence|first_eval_actionable_then_failed|pipeline_lookup_verdict_gate_runs_completed|pipeline_lookup_verdict_gate_early_stopped|pipeline_lookup_verdict_gate_early_stop_verdict|recheck_pipeline_lookup_verdict_gate_runs_completed|recheck_pipeline_lookup_verdict_gate_early_stopped|recheck_pipeline_lookup_verdict_gate_early_stop_verdict--limit <N>(-1keeps all rows)
Per-row output now also includes a derived
first_eval_actionable_then_failed column (yes|no|unknown) so downstream
tools can filter/aggregate the instability signal without recomputing it.
Autotriage sidecars also propagate latest eval verdict-gate metadata columns
(pipeline_lookup_verdict_gate_* and recheck_pipeline_lookup_verdict_gate_*)
so reject vs inconclusive early-stop exits can be analyzed directly from
autotriage reports.
To run a focused subset from the helper script (single op / impl / case), pass the same selectors supported by the bench binary:
HARRIER_BENCH_ONLY_OP=insert_new \
HARRIER_BENCH_ONLY_IMPL=harrier \
HARRIER_BENCH_ONLY_N=786432 \
HARRIER_BENCH_ONLY_LOAD=0.75 \
HARRIER_BENCH_ISOLATE_CASES=1 \
HARRIER_BENCH_ISOLATE_OPS=1 \
HARRIER_BENCH_ISOLATE_IMPLS=1 \
bash scripts/bench_iter.shHARRIER_BENCH_ONLY_OP also accepts a comma-separated list (for example:
find_hit,find_hit_prehashed,find_miss,find_miss_prehashed) when you want to
capture related operations in a single timestamped result file. With
HARRIER_BENCH_ISOLATE_OPS=1, comma-separated op lists are split and each op
is executed in its own isolated process.
HARRIER_BENCH_ONLY_IMPL also accepts comma-separated implementations (for
example: harrier,hashbrown). With HARRIER_BENCH_ISOLATE_IMPLS=1, each
implementation is run in its own isolated process.
You can also run a single implementation directly from the bench binary:
HARRIER_BENCH_ONLY_IMPL=harrier_u64 cargo run --release --bin bench