Archvteams 1597/kuberay s3 version cleanup by aaronbfagan · Pull Request #860 · nebius/nebius-solutions-library

aaronbfagan · 2026-03-06T21:48:55Z

Release Notes (Mandatory Description)

This PR hardens k8s-training CI behavior for a shared, GPU-constrained test environment and fixes the kuberay test input mismatch introduced by the current test configuration.

The main change is in the k8s-training Terraform test path: instead of allowing the test to wait indefinitely when GPU provisioning stalls, the workflow now detects known fatal GPU node-group events and fails early. The cleanup flow for storage buckets is also simplified to use Nebius-native bucket deletion and purge behavior rather than recursive object/version cleanup logic in CI.

Problem

k8s-training CI runs against a shared Nebius project with limited GPU capacity. In failure scenarios, the Terraform test could remain blocked for a long time while the GPU node group stayed in PROVISIONING, even after Nebius had already surfaced terminal signals such as:

ComputeInstanceCreationFailed
NodeNotFound with STOPPED for longer than ...

At the same time, the previous cleanup approach for storage buckets was more complex than necessary, relying on explicit recursive object/version cleanup logic instead of Nebius bucket lifecycle operations.

The branch also includes a fix for the kuberay Terraform test input name so the test configuration matches the module variable actually declared by k8s-training.

Changes

Fixed the k8s-training kuberay test input name:
- changed enable_kuberay to enable_kuberay_cluster
Updated Terraform Test behavior in CI:
- non-k8s-training solutions still run plain terraform test
- k8s-training now runs terraform test under a watcher
- the watcher polls the GPU node group and fails early on:
  - ComputeInstanceCreationFailed
  - NodeNotFound where the message contains STOPPED for longer than
Added an outer Terraform Test safety timeout:
- timeout-minutes: 150
Simplified shared bucket cleanup in cleanup-infra:
- use nebius storage bucket delete --ttl 0
- follow with best-effort nebius storage bucket purge
- remove storage buckets from the generic forced-cleanup loop

Why this approach

This keeps the failure logic targeted to the actual k8s-training GPU failure mode instead of relying only on a blunt wall-clock timeout. Healthy long-running tests can continue, while clearly broken GPU provisioning states fail fast and allow cleanup to begin sooner.

For bucket cleanup, the change aligns the workflow with Nebius-native deletion behavior and removes unnecessary complexity from CI.

Validation

Validated outside the pipeline by querying a real stuck GPU node group and confirming the watcher condition matches the live fatal event:

code == "NodeNotFound"
message contains STOPPED for longer than

Also confirmed the bucket cleanup behavior manually using Nebius CLI commands:

nebius storage bucket delete --id <bucket_id> --ttl 0
nebius storage bucket purge --id <bucket_id>

Expected outcome

k8s-training CI no longer waits indefinitely on a GPU node group that has already entered a terminal provisioning failure state
cleanup can begin sooner after fatal GPU provisioning failures
storage bucket cleanup is simpler and uses Nebius-native lifecycle behavior
the kuberay test configuration matches the module input and avoids the undeclared variable failure

…d for k8stesting to fail fast

Aaron Fagan added 3 commits March 6, 2026 11:31

ARCHVTEAMS-1583 fix kuberay test input and cleanup versioned buckets

1a6bab0

ARCHVTEAMS-1597 replace bucket deletion logic with CLI

b4e3d28

ARCHVTEAMS-1597 add polling and timeout Terraform Test on NodeNotFoun…

3b49f31

…d for k8stesting to fail fast

aaronbfagan requested review from aaroniscode, asteny, cyril-k, d3vil-st, dasbatta, elijah-k-nebius, jadnov, malibora, pbutler and rdjjke as code owners March 6, 2026 21:48

aaronbfagan had a problem deploying to project-e00pjzzrtk1fs3yavy March 6, 2026 21:49 — with GitHub Actions Error

aaronbfagan had a problem deploying to project-e00pjzzrtk1fs3yavy March 9, 2026 09:18 — with GitHub Actions Error

aaronbfagan closed this Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archvteams 1597/kuberay s3 version cleanup#860

Archvteams 1597/kuberay s3 version cleanup#860
aaronbfagan wants to merge 3 commits intomainfrom
ARCHVTEAMS-1597/kuberay-s3-version-cleanup

aaronbfagan commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaronbfagan commented Mar 6, 2026

Release Notes (Mandatory Description)

Problem

Changes

Why this approach

Validation

Expected outcome

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant