Stage 2: offline pilot of the SHS evaluation pipeline#6
Merged
Conversation
Adds a small-scale, --mock (no API) end-to-end exercise of the
campaign -> postprocess -> kofta-stats pipeline against the toy smoke
target, so the orchestration/extraction/stats wiring is validated before
the expensive Stage 3 campaigns spend real compute + API budget.
Surfaced and fixed three real wiring bugs:
* campaign.launch_one treated every non-zero exit as failure, but a
`timeout {duration}` cap returns 124 at the normal end of a run, so
every run was marked failed and post-processing skipped. Accept 124.
* post-processing wrote opts.csv into the cov run dir, but the undoc
loader reads undoc/<target>/<config>/<run>/opts.csv -- mirror it there
(same pattern as the cost record), so tab:undoc can populate.
* kofta-stats hardcoded the paper's 9 eval binaries with no override, so
any other target rendered as all-placeholder rows. Add --targets.
New pieces:
* shs/campaign.pilot.json -- pilot spec (kofta/kshs/kshsng x smoke x 2).
* docker/run-pilot.sh -- builds + instruments the toy target, runs
the pilot matrix, asserts kofta-stats emits a populated smoke row.
* .github/workflows/pilot-linux.yml -- runs it on a native x86_64
ubuntu:20.04 runner (the forkserver needs glibc<=2.33 AND real, non-
emulated x86_64), plus a host-agnostic orchestrator wiring test.
* shs/tests/test_campaign.py -- drives shs.campaign with fake fuzzer/
showmap/opts commands (no forkserver) to regression-gate the three
fixes on any host.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sertion Three fixes found while validating the Stage 2 pilot on a real Ubuntu 20.04 (glibc 2.31, system python3.8) image: - kofta-opts / kofta-stats: add `from __future__ import annotations` so the PEP 585 `list[...]`/`dict[...]` hints parse on python3.8 (Ubuntu 20.04's default), which otherwise raises "TypeError: 'type' object is not subscriptable" on import. This is the exact environment the forkserver requires. - docker/run-pilot.sh: kofta-stats emits every table row + facts line via debug.psay(), which writes to STDERR. The assertion grep'd `pilot-stats.out` produced by `... | tee`, which only captures stdout -- so the file was empty and every assertion failed spuriously despite a fully populated pipeline. Fold stderr into the pipe with `2>&1`. Verified: pilot now reports "PASS: pilot pipeline produced real tables from real artifacts" on emulated x86_64 / Ubuntu 20.04 (smoke cov row 7 & 7 edges). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stage 2 of the SHS evaluation: a small-scale, offline (
--mock, no API) end-to-end exercise of thekofta-campaign→ post-process →kofta-statspipeline against the toydocker/smoke.ctarget. The goal is to validate the orchestration / extraction / stats wiring before the expensive Stage 3 campaigns spend real compute + API budget.Three real wiring bugs surfaced and fixed
launch_onerejected timeout exit 124. Atimeout {duration}cap returns 124 at the normal end of a run, so every run was marked failed and post-processing was skipped. Now accepts0and124.opts.csvwritten to the wrong place. Post-processing wrote it into thecov/run dir, but the undoc loader readsundoc/<target>/<config>/<run>/opts.csv. Now mirrored there (same pattern as the cost record).kofta-statshad no target override. It hardcoded the paper's 9 eval binaries, so any other target rendered as all-placeholder rows. Added--targets.New pieces
shs/campaign.pilot.json— pilot spec (kofta/kshs/kshsng×smoke× 2 runs, 25 s each). Only KOFTA-buildable configs;weifuzz/llmonlyneed an externalwei-fuzznot in this repo.docker/run-pilot.sh— builds + instruments the toy target, lays out the campaign input tree, runs the pilot matrix, and assertskofta-statsemits a populatedsmokerow..github/workflows/pilot-linux.yml— runs the pilot on a native x86_64ubuntu:20.04runner. The forkserver needs glibc ≤2.33 and real (non-emulated) x86_64; QEMU-emulated x86_64 (e.g. Colima on Apple Silicon) breaks it the same way glibc 2.34+ does, so the real pilot can only run in CI.shs/tests/test_campaign.py— drivesshs.campaignwith fake fuzzer/showmap/opts commands (no forkserver) to regression-gate the three fixes on any host.Test plan
python3 -m pytest shs/tests/— 19/19 pass locally (incl. 3 new orchestrator wiring tests).pilot-linuxCI job green (real end-to-end on native x86_64 — first real exercise of the pipeline).🤖 Generated with Claude Code