Benchmark Guard: Block PRs When Direct Dependencies Change
Summary
When a nexus package (e.g. terratorch, tokamind, bmfm-targets) is updated in
pyproject.toml, benchmark results may change — the models may behave differently under
the new package version. PRs that introduce such changes must be blocked from merging until
benchmarks have been re-run and their results reviewed.
The mechanism works in three parts:
- Detection — the PR CI pipeline detects whether the PR changes any direct nexus package
dependency across the three variants (ecosystem, candidate, product).
- Blocking — when a change is detected, the CI adds a
pending-benchmarks label to the PR
and fails with a clear error message. Every subsequent pipeline run also checks for the label,
keeping the PR blocked until the label is removed.
- Clearance — the benchmark status checker removes the
pending-benchmarks label
automatically when all benchmark runs for that PR have completed successfully. If any run
failed, the label stays.
Sub-Tasks
1 — Script: detect nexus package dependency changes
Detect whether the PR changes any direct dependency in pyproject.toml compared to main.
The script compares resolved requirements per variant using uv export. If the resolved
packages for a variant differ between main and the PR branch, that variant is flagged.
Expected outcomes:
- Exits
0 when no nexus-related dependency changes are detected.
- Exits
1 and prints a clear summary (which packages changed, in which variants) when changes
are found.
- Self-contained and reusable from the PR pipeline.
Scope note: Detection is based on resolved requirements (via uv export), not raw
pyproject.toml text. A transitive update that reaches a nexus package also triggers the guard —
this is intentional and conservative.
2 — Script: manage the pending-benchmarks label
Encapsulate GitHub label management in a dedicated script with two modes:
apply — adds the pending-benchmarks label to the open PR.
check — checks whether the label is present; exits 1 (blocks the pipeline) if it is.
remove — removes the label once benchmarks have passed.
The script receives the PR number and repository URL via environment variables so it is portable
across both the PR pipeline and the benchmark status checker contexts.
Expected outcomes:
apply: label added, exits 0.
check: exits 1 with a clear message if the label is present; exits 0 otherwise.
remove: label removed, exits 0.
- Invalid argument: prints usage and exits
1.
3 — Wire detection + labelling into the PR pipeline
Extend the PR pipeline's unit-test step to:
- Run the dependency-change detection script. If changes are found, apply the
pending-benchmarks label and fail.
- Run the label-check script to block the PR if the label is already present (e.g. from a prior
run that detected changes).
A PR cannot merge as long as the label is present, regardless of how many times the pipeline
reruns.
Expected outcomes:
- A PR that changes a nexus package dependency has the label applied and the pipeline fails with
a clear, actionable message.
- A PR that already carries the label also fails, even if no new changes are detected in the
current run.
- Existing CI checks continue to run after the guard.
4 — Remove the label automatically when benchmarks pass
Extend the benchmark status checker to remove the pending-benchmarks label from the PR after
all Ray benchmark jobs complete successfully. If any job failed, the label is left in place.
Expected outcomes:
- All jobs successful →
pending-benchmarks label removed → PR can merge.
- Any job failed → label remains → PR stays blocked, author is notified.
- The existing PR comment update and notification steps are unchanged.
5 — Document the pending-benchmarks workflow
Add documentation to the repository explaining:
- What the
pending-benchmarks label means and what triggers it.
- Why it exists (benchmark results may change when a package version changes).
- The end-to-end flow: dependency change detected → label applied → PR blocked → benchmarks
triggered → if all pass, label removed → PR can merge.
- What happens if benchmarks fail: label remains, PR stays blocked.
Expected outcomes:
- A new document in
docs/contributing/ covers the benchmark guard workflow end-to-end.
- The document is linked from
docs/contributing/add_new_nexus_package.md.
- The new page is added to the documentation site navigation.
Related: #148
Prerequisites
- The
pending-benchmarks label must exist in the algorithm-nexus GitHub repository before
the scripts can use it. This is a one-time manual step (e.g. gh label create pending-benchmarks --color …).
- autoupdate PRs: the autoupdate workflow opens PRs automatically. Those PRs will also be
checked by the pipeline — if an auto-update bumps a nexus package, the pending-benchmarks
label will be applied correctly.
Benchmark Guard: Block PRs When Direct Dependencies Change
Summary
When a nexus package (e.g.
terratorch,tokamind,bmfm-targets) is updated inpyproject.toml, benchmark results may change — the models may behave differently underthe new package version. PRs that introduce such changes must be blocked from merging until
benchmarks have been re-run and their results reviewed.
The mechanism works in three parts:
dependency across the three variants (
ecosystem,candidate,product).pending-benchmarkslabel to the PRand fails with a clear error message. Every subsequent pipeline run also checks for the label,
keeping the PR blocked until the label is removed.
pending-benchmarkslabelautomatically when all benchmark runs for that PR have completed successfully. If any run
failed, the label stays.
Sub-Tasks
1 — Script: detect nexus package dependency changes
Detect whether the PR changes any direct dependency in
pyproject.tomlcompared tomain.The script compares resolved requirements per variant using
uv export. If the resolvedpackages for a variant differ between
mainand the PR branch, that variant is flagged.Expected outcomes:
0when no nexus-related dependency changes are detected.1and prints a clear summary (which packages changed, in which variants) when changesare found.
Scope note: Detection is based on resolved requirements (via
uv export), not rawpyproject.tomltext. A transitive update that reaches a nexus package also triggers the guard —this is intentional and conservative.
2 — Script: manage the
pending-benchmarkslabelEncapsulate GitHub label management in a dedicated script with two modes:
apply— adds thepending-benchmarkslabel to the open PR.check— checks whether the label is present; exits1(blocks the pipeline) if it is.remove— removes the label once benchmarks have passed.The script receives the PR number and repository URL via environment variables so it is portable
across both the PR pipeline and the benchmark status checker contexts.
Expected outcomes:
apply: label added, exits0.check: exits1with a clear message if the label is present; exits0otherwise.remove: label removed, exits0.1.3 — Wire detection + labelling into the PR pipeline
Extend the PR pipeline's unit-test step to:
pending-benchmarkslabel and fail.run that detected changes).
A PR cannot merge as long as the label is present, regardless of how many times the pipeline
reruns.
Expected outcomes:
a clear, actionable message.
current run.
4 — Remove the label automatically when benchmarks pass
Extend the benchmark status checker to remove the
pending-benchmarkslabel from the PR afterall Ray benchmark jobs complete successfully. If any job failed, the label is left in place.
Expected outcomes:
pending-benchmarkslabel removed → PR can merge.5 — Document the
pending-benchmarksworkflowAdd documentation to the repository explaining:
pending-benchmarkslabel means and what triggers it.triggered → if all pass, label removed → PR can merge.
Expected outcomes:
docs/contributing/covers the benchmark guard workflow end-to-end.docs/contributing/add_new_nexus_package.md.Related: #148
Prerequisites
pending-benchmarkslabel must exist in thealgorithm-nexusGitHub repository beforethe scripts can use it. This is a one-time manual step (e.g.
gh label create pending-benchmarks --color …).checked by the pipeline — if an auto-update bumps a nexus package, the
pending-benchmarkslabel will be applied correctly.