test(producer): add parallel capture regression test#1088
Conversation
Add a regression fixture that forces workers: 2, ensuring the parallel capture code path (browser-per-worker in BeginFrame mode) is exercised in CI. All existing fixtures pin workers: 1, so this is the first test that would catch a regression in the multi-worker pool isolation fix from PR #1087. The composition is 5s @ 30fps (150 frames), which exceeds both MIN_FRAMES_PER_WORKER * 2 (60) and minParallelFrames (120), so the parallel coordinator will always split work across workers. Baseline output/output.mp4 must be generated inside Dockerfile.test before the fixture can run in CI.
Matches realistic auto-mode worker counts (4-6 on typical machines), not just the minimum (2) that triggers the bug.
Generated inside Dockerfile.test on amd64 Linux (Docker image hyperframes-producer:test) to match the CI rendering environment.
vanceingalls
left a comment
There was a problem hiding this comment.
Suggested status: Request changes.
Findings:
- Blocking: the new fixture is never run by CI. The regression workflow invokes the harness with explicit shard args (
.github/workflows/regression.yml:67-82), andparallel-capture-regressionis not listed in any shard. Becauseregression-harness.tstreats CLI names as a filter, this PR currently adds files but does not exercise the multi-worker path in CI, which is the stated purpose of the change. Please add the fixture to one shard (or change the workflow to discover all fixtures) and keep an eye on shard balance since this is a 150-frame, 4-worker render.
I did static review plus git diff --check; I did not run the Docker regression locally.
jrusso1020
left a comment
There was a problem hiding this comment.
Good instinct closing the multi-worker coverage gap, and the fixture is clean. But I have one load-bearing concern about whether this test actually catches the #1087 bug — surfaced by the prod investigation happening in parallel.
The effectiveness problem: CPU CI can't reproduce a GPU-contention crash
Dockerfile.test runs on node:22-bookworm-slim with no GPU → SwiftShader (software) rendering. It installs chrome-headless-shell@148... so BeginFrame mode is active, and the fixture's workers: 4 will drive 4 capture workers sharing the pooled browser. So far so good — the multi-worker BeginFrame path is exercised.
But the #1087 crash appears to be GPU-compositor-contention-specific. Evidence gathered today:
- Prod (
service:producer) runs multi-worker + shared-pool + BeginFrame on CPU-onlym6a.12xlargenodes, and has zeroTarget closed/ compositor crashes over 30 days. - The repro that motivated #1087 was on a
--browser-gpu(hardware) setup; the PR root-cause itself notes the failure is worst "on hosts with hardware GPU + EGL." - Confirmed in the review thread: prod doesn't hit it "because we are using only cpu."
So on SwiftShader/CPU — which is exactly what this Docker test runs — the compositor race doesn't fire. This fixture will very likely render fine at 4 workers even without #1087's per-worker isolation, which means it would pass on un-fixed code and provide no regression protection for the bug it's named after.
The decisive check (free, already running): this PR is branched off main, which does not contain #1087 yet. If the new parallel-capture-regression shard goes green in this PR's own CI, that's proof the test does not catch the crash (it passed on un-fixed code). I'd hold the PR until that shard reports — if green, the fixture needs rework before it's meaningful. (Shards were still in-progress when I looked.)
What would make it actually guard the regression:
- A GPU CI runner for this fixture — the only environment where the race reproduces. May not exist in the hyperframes fleet; if not, that's a real constraint to call out rather than ship a test that can't fail.
- Failing that, be honest about scope: rename/reframe it as "multi-worker parallel-capture path coverage" (it does guard frame-ordering / reorder-buffer / PSNR regressions in
executeParallelCapture, which is genuinely useful) — but it does not guard the BeginFrame compositor crash from #1087. Two different things.
Smaller notes
- Doc drift: PR body says "forces
workers: 2" and "assign work to both workers," butmeta.jsonhas"workers": 4. Harmless but reconcile so the next reader isn't misled. - Dead CSS: the
@keyframes count-up { content: "1" ... }block doesn't do anything —contentdoesn't apply to a regular<div>, only::before/::after. The actual counter is the rAF script (el.textContent = ...), which is fine. Drop the dead keyframes or move the counter to a pseudo-element. - Determinism: the rAF counter uses
(ts - startTime)/1000. In BeginFrame modetsis virtual so it's deterministic; just noting it relies on that.minPsnr: 30is appropriately lenient for a crash/smoke fixture.
Net: the fixture is well-built and worth having for parallel-path coverage, but as currently wired (CPU CI) it almost certainly does not catch the #1087 crash. Confirm via the shard result; if green, it needs a GPU runner or an honest rescope.
— Rames Jusso (hyperframes)
- Add fixture to shard-5 in regression.yml so CI actually runs it - Reframe description: multi-worker path coverage (frame distribution, reorder buffer, per-worker browser lifecycle), not GPU-specific crash guard — SwiftShader CI can't reproduce the hardware compositor race - Remove dead @Keyframes count-up (content doesn't apply to div) - Remove unused CSS animation reference on .counter - Regenerate golden baseline with cleaned-up HTML
|
Addressed both reviews in bd93ae9: Vance — added Russo — spot-on analysis on the SwiftShader/CPU limitation. Updated:
On the broader question: this test guards against regressions in the |
jrusso1020
left a comment
There was a problem hiding this comment.
Re-reviewed bd93ae91 (delta since my last pass on 8f888aa6). Every prior finding is resolved — clean.
Prior findings — all addressed:
-
Critical — rescope from "crash guard" to "path coverage" → ✅ resolved in
bd93ae91.meta.jsondescription now reads "Multi-worker parallel capture path coverage… Note: the GPU-specific BeginFrame compositor race (#1087) only reproduces on hardware-GPU hosts, not in SwiftShader CI." PR body has the matching Scope note. This is exactly the honest framing — the fixture guardsexecuteParallelCaptureframe-distribution / reorder-buffer / per-worker-lifecycle / PSNR, and explicitly does not claim to catch the GPU crash. -
Fixture wasn't wired into a shard → ✅ resolved.
regression.ymlshard-5 args now includeparallel-capture-regression(name matches the fixture dir). Without this it wouldn't have run in CI at all. -
Doc drift (workers 2 vs 4) → ✅ resolved.
meta.jsonworkers: 4and the PR body both say 4 consistently now. -
Dead
@keyframes count-up+ unused.counteranimation → ✅ resolved. Both removed fromsrc/index.htmlandoutput/compiled.html; the rAF script remains as the actual counter driver. Baseline regenerated (output.mp4oid updated) to match the cleaned HTML — correct, since the CSS removal changes rendered output.
Nothing re-raised; no new concerns in the delta (it's purely the reframe + shard wiring + dead-CSS removal + baseline regen).
One note on what "green" means now: this PR is branched off main without #1087, and the bug is GPU-only — so shard-5 passing on CPU CI is the intended success signal (it proves the 4-worker parallel path renders correctly and the baseline matches), not a sign of ineffectiveness. That's the correct outcome for a path-coverage test. The test plan's last unchecked box (CI shard-5 passes) is the only remaining gate — CI is re-running on this SHA now.
Determinism holds: the rAF counter uses virtual ts under BeginFrame → deterministic, and minPsnr: 30 gives margin.
Net: all feedback incorporated, scope is now honest, shard wiring correct. Good to merge once shard-5 reports green.
— Rames Jusso (hyperframes)
jrusso1020
left a comment
There was a problem hiding this comment.
Approving — all review feedback from the prior passes is addressed in bd93ae91: rescoped to multi-worker path coverage (with the explicit SwiftShader-can't-repro-the-GPU-crash note), wired into regression shard-5, workers:4 doc drift fixed, dead @keyframes count-up removed, baseline regenerated to match. Scope is now honest and the fixture meaningfully covers the executeParallelCapture path. Last gate is shard-5 reporting green on this SHA before merge.
vanceingalls
left a comment
There was a problem hiding this comment.
Suggested status: Approve from code review; wait for regression shards to finish, especially shard-5.
The previous blocker is fixed: parallel-capture-regression is now included in .github/workflows/regression.yml shard-5, so CI actually runs the new fixture.
No blocking issues found in the PR-scoped diff.
Non-blocking notes:
- The fixture uses raw
requestAnimationFrame, so the compiler routes it through screenshot capture mode. That is fine for multi-worker capture coverage, and the PR body now correctly scopes out the hardware-GPU BeginFrame race from #1087. - The PR body/meta still overstate streaming-encoder coverage:
shouldUseStreamingEncode(..., workerCount=4)returns false, so this fixture covers disk-based parallel capture, not the FFmpeg stdin reorder buffer / streaming ordered-write path. I would tighten that wording before merge to avoid future confusion. git diff --checkreports trailing whitespace in generatedoutput/compiled.htmlline 114; CI format is green, so I am treating this as generated-file noise rather than a blocker.
…ure test The composition used requestAnimationFrame for a frame counter and CSS @Keyframes for animations, which triggered screenshot capture mode (non-deterministic across workers) and caused 29 PSNR failures in CI. All animations now use the GSAP timeline, keeping the render in deterministic BeginFrame mode. Baseline regenerated in Docker.
Summary
Adds a regression test that exercises the multi-worker parallel capture path (
workers: 4), which previously had zero CI coverage. Every existing test either pinnedworkers: 1or used compositions too short for parallel to activate.This test guards against regressions in:
executeParallelCaptureframe distribution across 4 workersScope note: The GPU-specific BeginFrame compositor race (#1087) only reproduces on hardware-GPU hosts. SwiftShader CI exercises the parallel code path but cannot trigger the compositor contention. A hardware-GPU CI runner would be needed to guard that specific crash — tracked separately.
Changes
packages/producer/tests/parallel-capture-regression/— new fixture with 5s CSS animation composition (150 frames at 30fps),workers: 4.github/workflows/regression.yml— fixture added to shard-5Dockerfile.teston amd64 LinuxTest plan