Skip to content

feat(ci): require running benchmarks when PRs update nexus packages #148

Description

@AlessandroPomponio

Benchmark Guard: Block PRs When Direct Dependencies Change

Summary

When a nexus package (e.g. terratorch, tokamind, bmfm-targets) is updated in
pyproject.toml, benchmark results may change — the models may behave differently under
the new package version. PRs that introduce such changes must be blocked from merging until
benchmarks have been re-run and their results reviewed.

The mechanism works in three parts:

  1. Detection — the PR CI pipeline detects whether the PR changes any direct nexus package
    dependency across the three variants (ecosystem, candidate, product).
  2. Blocking — when a change is detected, the CI adds a pending-benchmarks label to the PR
    and fails with a clear error message. Every subsequent pipeline run also checks for the label,
    keeping the PR blocked until the label is removed.
  3. Clearance — the benchmark status checker removes the pending-benchmarks label
    automatically when all benchmark runs for that PR have completed successfully. If any run
    failed, the label stays.

Sub-Tasks

1 — Script: detect nexus package dependency changes

Detect whether the PR changes any direct dependency in pyproject.toml compared to main.

The script compares resolved requirements per variant using uv export. If the resolved
packages for a variant differ between main and the PR branch, that variant is flagged.

Expected outcomes:

  • Exits 0 when no nexus-related dependency changes are detected.
  • Exits 1 and prints a clear summary (which packages changed, in which variants) when changes
    are found.
  • Self-contained and reusable from the PR pipeline.

Scope note: Detection is based on resolved requirements (via uv export), not raw
pyproject.toml text. A transitive update that reaches a nexus package also triggers the guard —
this is intentional and conservative.


2 — Script: manage the pending-benchmarks label

Encapsulate GitHub label management in a dedicated script with two modes:

  • apply — adds the pending-benchmarks label to the open PR.
  • check — checks whether the label is present; exits 1 (blocks the pipeline) if it is.
  • remove — removes the label once benchmarks have passed.

The script receives the PR number and repository URL via environment variables so it is portable
across both the PR pipeline and the benchmark status checker contexts.

Expected outcomes:

  • apply: label added, exits 0.
  • check: exits 1 with a clear message if the label is present; exits 0 otherwise.
  • remove: label removed, exits 0.
  • Invalid argument: prints usage and exits 1.

3 — Wire detection + labelling into the PR pipeline

Extend the PR pipeline's unit-test step to:

  1. Run the dependency-change detection script. If changes are found, apply the
    pending-benchmarks label and fail.
  2. Run the label-check script to block the PR if the label is already present (e.g. from a prior
    run that detected changes).

A PR cannot merge as long as the label is present, regardless of how many times the pipeline
reruns.

Expected outcomes:

  • A PR that changes a nexus package dependency has the label applied and the pipeline fails with
    a clear, actionable message.
  • A PR that already carries the label also fails, even if no new changes are detected in the
    current run.
  • Existing CI checks continue to run after the guard.

4 — Remove the label automatically when benchmarks pass

Extend the benchmark status checker to remove the pending-benchmarks label from the PR after
all Ray benchmark jobs complete successfully. If any job failed, the label is left in place.

Expected outcomes:

  • All jobs successful → pending-benchmarks label removed → PR can merge.
  • Any job failed → label remains → PR stays blocked, author is notified.
  • The existing PR comment update and notification steps are unchanged.

5 — Document the pending-benchmarks workflow

Add documentation to the repository explaining:

  • What the pending-benchmarks label means and what triggers it.
  • Why it exists (benchmark results may change when a package version changes).
  • The end-to-end flow: dependency change detected → label applied → PR blocked → benchmarks
    triggered → if all pass, label removed → PR can merge.
  • What happens if benchmarks fail: label remains, PR stays blocked.

Expected outcomes:

  • A new document in docs/contributing/ covers the benchmark guard workflow end-to-end.
  • The document is linked from docs/contributing/add_new_nexus_package.md.
  • The new page is added to the documentation site navigation.

Related: #148


Prerequisites

  • The pending-benchmarks label must exist in the algorithm-nexus GitHub repository before
    the scripts can use it. This is a one-time manual step (e.g. gh label create pending-benchmarks --color …).
  • autoupdate PRs: the autoupdate workflow opens PRs automatically. Those PRs will also be
    checked by the pipeline — if an auto-update bumps a nexus package, the pending-benchmarks
    label will be applied correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions