Skip to content

ARCHVTEAMS-1583 serialize k8s-training CI and fix kuberay test variable#858

Closed
aaronbfagan wants to merge 1 commit intomainfrom
ARCHVTEAMS-1583/k8s-training-serialize-only
Closed

ARCHVTEAMS-1583 serialize k8s-training CI and fix kuberay test variable#858
aaronbfagan wants to merge 1 commit intomainfrom
ARCHVTEAMS-1583/k8s-training-serialize-only

Conversation

@aaronbfagan
Copy link
Copy Markdown
Collaborator

Release Notes (Mandatory Description)

This PR introduces a minimal CI hardening change for k8s-training only:

  1. Serialize k8s-training Terraform Plan/Test runs across PRs.
  2. Fix a stale KubeRay test variable name in k8s-training tests.

Problem

k8s-training runs in a shared, GPU-constrained test environment. Parallel PR runs can contend for the same limited resources and produce flaky/non-deterministic failures.

Additionally, the KubeRay test suite referenced a deprecated variable name (enable_kuberay) that is no longer declared by the module, causing warnings and reducing test signal quality.

Changes

1) CI serialization for k8s-training only

Updated .github/workflows/terraform.yml terraform matrix job concurrency:

  • k8s-training uses a fixed concurrency group: k8s-training-gpu-ci
  • Non-k8s-training solutions keep per-run unique concurrency groups
  • cancel-in-progress: false ensures queued runs wait instead of canceling active runs

This keeps the change narrowly scoped to the path with known shared-capacity contention.

2) KubeRay test variable fix

Updated k8s-training/tests/k8s-training-kuberay.tftest.hcl:

  • Replaced enable_kuberay = true
  • With enable_kuberay_cluster = true

This aligns the test with current module inputs and removes undeclared-variable warnings.

Why this approach

This is intentionally a small, self-contained change set to reduce risk:

  • No shared module behavior changes
  • No cleanup behavior changes
  • No resource naming changes
  • No impact to non-k8s-training test paths beyond existing run-level behavior

Expected outcome

  • k8s-training Plan/Test jobs no longer execute concurrently across PRs.
  • KubeRay test no longer emits undeclared-variable warning for enable_kuberay.

Validation performed

  • Confirmed diff scope is limited to:
    • .github/workflows/terraform.yml
    • k8s-training/tests/k8s-training-kuberay.tftest.hcl
  • Verified workflow expression targets only k8s-training for global serialization.
  • Verified test variable now matches declared module variable name.

DoD alignment

  • k8s-training CI execution is single-threaded across PRs.
  • Test configuration reflects current module variable interface.
  • Changes are minimal and isolated to avoid regressions in unrelated paths.

@aaronbfagan aaronbfagan had a problem deploying to project-e00pjzzrtk1fs3yavy March 6, 2026 15:55 — with GitHub Actions Error
@aaronbfagan aaronbfagan had a problem deploying to project-e00pjzzrtk1fs3yavy March 6, 2026 15:57 — with GitHub Actions Failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant