Skip to content

Archvteams 1597/kuberay s3 version cleanup#860

Closed
aaronbfagan wants to merge 3 commits intomainfrom
ARCHVTEAMS-1597/kuberay-s3-version-cleanup
Closed

Archvteams 1597/kuberay s3 version cleanup#860
aaronbfagan wants to merge 3 commits intomainfrom
ARCHVTEAMS-1597/kuberay-s3-version-cleanup

Conversation

@aaronbfagan
Copy link
Copy Markdown
Collaborator

Release Notes (Mandatory Description)

This PR hardens k8s-training CI behavior for a shared, GPU-constrained test environment and fixes the kuberay test input mismatch introduced by the current test configuration.

The main change is in the k8s-training Terraform test path: instead of allowing the test to wait indefinitely when GPU provisioning stalls, the workflow now detects known fatal GPU node-group events and fails early. The cleanup flow for storage buckets is also simplified to use Nebius-native bucket deletion and purge behavior rather than recursive object/version cleanup logic in CI.

Problem

k8s-training CI runs against a shared Nebius project with limited GPU capacity. In failure scenarios, the Terraform test could remain blocked for a long time while the GPU node group stayed in PROVISIONING, even after Nebius had already surfaced terminal signals such as:

  • ComputeInstanceCreationFailed
  • NodeNotFound with STOPPED for longer than ...

At the same time, the previous cleanup approach for storage buckets was more complex than necessary, relying on explicit recursive object/version cleanup logic instead of Nebius bucket lifecycle operations.

The branch also includes a fix for the kuberay Terraform test input name so the test configuration matches the module variable actually declared by k8s-training.

Changes

  • Fixed the k8s-training kuberay test input name:
    • changed enable_kuberay to enable_kuberay_cluster
  • Updated Terraform Test behavior in CI:
    • non-k8s-training solutions still run plain terraform test
    • k8s-training now runs terraform test under a watcher
    • the watcher polls the GPU node group and fails early on:
      • ComputeInstanceCreationFailed
      • NodeNotFound where the message contains STOPPED for longer than
  • Added an outer Terraform Test safety timeout:
    • timeout-minutes: 150
  • Simplified shared bucket cleanup in cleanup-infra:
    • use nebius storage bucket delete --ttl 0
    • follow with best-effort nebius storage bucket purge
    • remove storage buckets from the generic forced-cleanup loop

Why this approach

This keeps the failure logic targeted to the actual k8s-training GPU failure mode instead of relying only on a blunt wall-clock timeout. Healthy long-running tests can continue, while clearly broken GPU provisioning states fail fast and allow cleanup to begin sooner.

For bucket cleanup, the change aligns the workflow with Nebius-native deletion behavior and removes unnecessary complexity from CI.

Validation

Validated outside the pipeline by querying a real stuck GPU node group and confirming the watcher condition matches the live fatal event:

  • code == "NodeNotFound"
  • message contains STOPPED for longer than

Also confirmed the bucket cleanup behavior manually using Nebius CLI commands:

  • nebius storage bucket delete --id <bucket_id> --ttl 0
  • nebius storage bucket purge --id <bucket_id>

Expected outcome

  • k8s-training CI no longer waits indefinitely on a GPU node group that has already entered a terminal provisioning failure state
  • cleanup can begin sooner after fatal GPU provisioning failures
  • storage bucket cleanup is simpler and uses Nebius-native lifecycle behavior
  • the kuberay test configuration matches the module input and avoids the undeclared variable failure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant