Skip to content

Stage 2: offline pilot of the SHS evaluation pipeline#6

Merged
skhuang merged 2 commits into
mainfrom
skhuang/stage2-pilot
May 31, 2026
Merged

Stage 2: offline pilot of the SHS evaluation pipeline#6
skhuang merged 2 commits into
mainfrom
skhuang/stage2-pilot

Conversation

@skhuang
Copy link
Copy Markdown
Collaborator

@skhuang skhuang commented May 31, 2026

Summary

Stage 2 of the SHS evaluation: a small-scale, offline (--mock, no API) end-to-end exercise of the kofta-campaign → post-process → kofta-stats pipeline against the toy docker/smoke.c target. The goal is to validate the orchestration / extraction / stats wiring before the expensive Stage 3 campaigns spend real compute + API budget.

Three real wiring bugs surfaced and fixed

  • launch_one rejected timeout exit 124. A timeout {duration} cap returns 124 at the normal end of a run, so every run was marked failed and post-processing was skipped. Now accepts 0 and 124.
  • opts.csv written to the wrong place. Post-processing wrote it into the cov/ run dir, but the undoc loader reads undoc/<target>/<config>/<run>/opts.csv. Now mirrored there (same pattern as the cost record).
  • kofta-stats had no target override. It hardcoded the paper's 9 eval binaries, so any other target rendered as all-placeholder rows. Added --targets.

New pieces

  • shs/campaign.pilot.json — pilot spec (kofta/kshs/kshsng × smoke × 2 runs, 25 s each). Only KOFTA-buildable configs; weifuzz/llmonly need an external wei-fuzz not in this repo.
  • docker/run-pilot.sh — builds + instruments the toy target, lays out the campaign input tree, runs the pilot matrix, and asserts kofta-stats emits a populated smoke row.
  • .github/workflows/pilot-linux.yml — runs the pilot on a native x86_64 ubuntu:20.04 runner. The forkserver needs glibc ≤2.33 and real (non-emulated) x86_64; QEMU-emulated x86_64 (e.g. Colima on Apple Silicon) breaks it the same way glibc 2.34+ does, so the real pilot can only run in CI.
  • shs/tests/test_campaign.py — drives shs.campaign with fake fuzzer/showmap/opts commands (no forkserver) to regression-gate the three fixes on any host.

Test plan

  • python3 -m pytest shs/tests/ — 19/19 pass locally (incl. 3 new orchestrator wiring tests).
  • pilot-linux CI job green (real end-to-end on native x86_64 — first real exercise of the pipeline).

🤖 Generated with Claude Code

skhuang and others added 2 commits May 31, 2026 19:27
Adds a small-scale, --mock (no API) end-to-end exercise of the
campaign -> postprocess -> kofta-stats pipeline against the toy smoke
target, so the orchestration/extraction/stats wiring is validated before
the expensive Stage 3 campaigns spend real compute + API budget.

Surfaced and fixed three real wiring bugs:
  * campaign.launch_one treated every non-zero exit as failure, but a
    `timeout {duration}` cap returns 124 at the normal end of a run, so
    every run was marked failed and post-processing skipped. Accept 124.
  * post-processing wrote opts.csv into the cov run dir, but the undoc
    loader reads undoc/<target>/<config>/<run>/opts.csv -- mirror it there
    (same pattern as the cost record), so tab:undoc can populate.
  * kofta-stats hardcoded the paper's 9 eval binaries with no override, so
    any other target rendered as all-placeholder rows. Add --targets.

New pieces:
  * shs/campaign.pilot.json -- pilot spec (kofta/kshs/kshsng x smoke x 2).
  * docker/run-pilot.sh      -- builds + instruments the toy target, runs
    the pilot matrix, asserts kofta-stats emits a populated smoke row.
  * .github/workflows/pilot-linux.yml -- runs it on a native x86_64
    ubuntu:20.04 runner (the forkserver needs glibc<=2.33 AND real, non-
    emulated x86_64), plus a host-agnostic orchestrator wiring test.
  * shs/tests/test_campaign.py -- drives shs.campaign with fake fuzzer/
    showmap/opts commands (no forkserver) to regression-gate the three
    fixes on any host.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sertion

Three fixes found while validating the Stage 2 pilot on a real Ubuntu 20.04
(glibc 2.31, system python3.8) image:

- kofta-opts / kofta-stats: add `from __future__ import annotations` so the
  PEP 585 `list[...]`/`dict[...]` hints parse on python3.8 (Ubuntu 20.04's
  default), which otherwise raises "TypeError: 'type' object is not subscriptable"
  on import. This is the exact environment the forkserver requires.

- docker/run-pilot.sh: kofta-stats emits every table row + facts line via
  debug.psay(), which writes to STDERR. The assertion grep'd `pilot-stats.out`
  produced by `... | tee`, which only captures stdout -- so the file was empty
  and every assertion failed spuriously despite a fully populated pipeline.
  Fold stderr into the pipe with `2>&1`.

Verified: pilot now reports "PASS: pilot pipeline produced real tables from
real artifacts" on emulated x86_64 / Ubuntu 20.04 (smoke cov row 7 & 7 edges).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@skhuang skhuang merged commit 3ae5d07 into main May 31, 2026
6 checks passed
@skhuang skhuang deleted the skhuang/stage2-pilot branch May 31, 2026 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant