Stage 2: offline pilot of the SHS evaluation pipeline by skhuang · Pull Request #6 · AlaRduTP/KOFTA

skhuang · 2026-05-31T11:28:42Z

Summary

Stage 2 of the SHS evaluation: a small-scale, offline (--mock, no API) end-to-end exercise of the kofta-campaign → post-process → kofta-stats pipeline against the toy docker/smoke.c target. The goal is to validate the orchestration / extraction / stats wiring before the expensive Stage 3 campaigns spend real compute + API budget.

Three real wiring bugs surfaced and fixed

launch_one rejected timeout exit 124. A timeout {duration} cap returns 124 at the normal end of a run, so every run was marked failed and post-processing was skipped. Now accepts 0 and 124.
opts.csv written to the wrong place. Post-processing wrote it into the cov/ run dir, but the undoc loader reads undoc/<target>/<config>/<run>/opts.csv. Now mirrored there (same pattern as the cost record).
kofta-stats had no target override. It hardcoded the paper's 9 eval binaries, so any other target rendered as all-placeholder rows. Added --targets.

New pieces

shs/campaign.pilot.json — pilot spec (kofta/kshs/kshsng × smoke × 2 runs, 25 s each). Only KOFTA-buildable configs; weifuzz/llmonly need an external wei-fuzz not in this repo.
docker/run-pilot.sh — builds + instruments the toy target, lays out the campaign input tree, runs the pilot matrix, and asserts kofta-stats emits a populated smoke row.
.github/workflows/pilot-linux.yml — runs the pilot on a native x86_64 ubuntu:20.04 runner. The forkserver needs glibc ≤2.33 and real (non-emulated) x86_64; QEMU-emulated x86_64 (e.g. Colima on Apple Silicon) breaks it the same way glibc 2.34+ does, so the real pilot can only run in CI.
shs/tests/test_campaign.py — drives shs.campaign with fake fuzzer/showmap/opts commands (no forkserver) to regression-gate the three fixes on any host.

Test plan

python3 -m pytest shs/tests/ — 19/19 pass locally (incl. 3 new orchestrator wiring tests).
pilot-linux CI job green (real end-to-end on native x86_64 — first real exercise of the pipeline).

🤖 Generated with Claude Code

Adds a small-scale, --mock (no API) end-to-end exercise of the campaign -> postprocess -> kofta-stats pipeline against the toy smoke target, so the orchestration/extraction/stats wiring is validated before the expensive Stage 3 campaigns spend real compute + API budget. Surfaced and fixed three real wiring bugs: * campaign.launch_one treated every non-zero exit as failure, but a `timeout {duration}` cap returns 124 at the normal end of a run, so every run was marked failed and post-processing skipped. Accept 124. * post-processing wrote opts.csv into the cov run dir, but the undoc loader reads undoc/<target>/<config>/<run>/opts.csv -- mirror it there (same pattern as the cost record), so tab:undoc can populate. * kofta-stats hardcoded the paper's 9 eval binaries with no override, so any other target rendered as all-placeholder rows. Add --targets. New pieces: * shs/campaign.pilot.json -- pilot spec (kofta/kshs/kshsng x smoke x 2). * docker/run-pilot.sh -- builds + instruments the toy target, runs the pilot matrix, asserts kofta-stats emits a populated smoke row. * .github/workflows/pilot-linux.yml -- runs it on a native x86_64 ubuntu:20.04 runner (the forkserver needs glibc<=2.33 AND real, non- emulated x86_64), plus a host-agnostic orchestrator wiring test. * shs/tests/test_campaign.py -- drives shs.campaign with fake fuzzer/ showmap/opts commands (no forkserver) to regression-gate the three fixes on any host. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…sertion Three fixes found while validating the Stage 2 pilot on a real Ubuntu 20.04 (glibc 2.31, system python3.8) image: - kofta-opts / kofta-stats: add `from __future__ import annotations` so the PEP 585 `list[...]`/`dict[...]` hints parse on python3.8 (Ubuntu 20.04's default), which otherwise raises "TypeError: 'type' object is not subscriptable" on import. This is the exact environment the forkserver requires. - docker/run-pilot.sh: kofta-stats emits every table row + facts line via debug.psay(), which writes to STDERR. The assertion grep'd `pilot-stats.out` produced by `... | tee`, which only captures stdout -- so the file was empty and every assertion failed spuriously despite a fully populated pipeline. Fold stderr into the pipe with `2>&1`. Verified: pilot now reports "PASS: pilot pipeline produced real tables from real artifacts" on emulated x86_64 / Ubuntu 20.04 (smoke cov row 7 & 7 edges). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

skhuang and others added 2 commits May 31, 2026 19:27

skhuang merged commit 3ae5d07 into main May 31, 2026
6 checks passed

skhuang deleted the skhuang/stage2-pilot branch May 31, 2026 11:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 2: offline pilot of the SHS evaluation pipeline#6

Stage 2: offline pilot of the SHS evaluation pipeline#6
skhuang merged 2 commits into
mainfrom
skhuang/stage2-pilot

skhuang commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skhuang commented May 31, 2026

Summary

Three real wiring bugs surfaced and fixed

New pieces

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant