Archvteams 1597/kuberay s3 version cleanup#860
Closed
aaronbfagan wants to merge 3 commits intomainfrom
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Release Notes (Mandatory Description)
This PR hardens
k8s-trainingCI behavior for a shared, GPU-constrained test environment and fixes thekuberaytest input mismatch introduced by the current test configuration.The main change is in the
k8s-trainingTerraform test path: instead of allowing the test to wait indefinitely when GPU provisioning stalls, the workflow now detects known fatal GPU node-group events and fails early. The cleanup flow for storage buckets is also simplified to use Nebius-native bucket deletion and purge behavior rather than recursive object/version cleanup logic in CI.Problem
k8s-trainingCI runs against a shared Nebius project with limited GPU capacity. In failure scenarios, the Terraform test could remain blocked for a long time while the GPU node group stayed inPROVISIONING, even after Nebius had already surfaced terminal signals such as:ComputeInstanceCreationFailedNodeNotFoundwithSTOPPED for longer than ...At the same time, the previous cleanup approach for storage buckets was more complex than necessary, relying on explicit recursive object/version cleanup logic instead of Nebius bucket lifecycle operations.
The branch also includes a fix for the
kuberayTerraform test input name so the test configuration matches the module variable actually declared byk8s-training.Changes
k8s-trainingkuberaytest input name:enable_kuberaytoenable_kuberay_clusterTerraform Testbehavior in CI:k8s-trainingsolutions still run plainterraform testk8s-trainingnow runsterraform testunder a watcherComputeInstanceCreationFailedNodeNotFoundwhere the message containsSTOPPED for longer thanTerraform Testsafety timeout:timeout-minutes: 150cleanup-infra:nebius storage bucket delete --ttl 0nebius storage bucket purgeWhy this approach
This keeps the failure logic targeted to the actual
k8s-trainingGPU failure mode instead of relying only on a blunt wall-clock timeout. Healthy long-running tests can continue, while clearly broken GPU provisioning states fail fast and allow cleanup to begin sooner.For bucket cleanup, the change aligns the workflow with Nebius-native deletion behavior and removes unnecessary complexity from CI.
Validation
Validated outside the pipeline by querying a real stuck GPU node group and confirming the watcher condition matches the live fatal event:
code == "NodeNotFound"STOPPED for longer thanAlso confirmed the bucket cleanup behavior manually using Nebius CLI commands:
nebius storage bucket delete --id <bucket_id> --ttl 0nebius storage bucket purge --id <bucket_id>Expected outcome
k8s-trainingCI no longer waits indefinitely on a GPU node group that has already entered a terminal provisioning failure statekuberaytest configuration matches the module input and avoids the undeclared variable failure