[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature#7211
Open
hehe7318 wants to merge 5 commits intoaws:developfrom
Open
[develop][E2E Test] Add E2E test for Slurm 25.11 expedited requeue mode feature#7211hehe7318 wants to merge 5 commits intoaws:developfrom
hehe7318 wants to merge 5 commits intoaws:developfrom
Conversation
Extend test_fast_capacity_failover to validate the new --requeue=expedite option introduced in Slurm 25.11.2. This feature allows batch jobs to automatically requeue on node failure with highest priority.
…eue jobs are treated as highest priority.
- Change job commands from simple 'sleep 30' to output hostname and timestamps, making it easier to verify job execution in output files - Add --prefer option to job2 targeting the same compute resource as job1 - Increase job2 node request from 1 to 2 nodes to prevent it from immediately running on another CR before job1 requeues
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes
Extend test_fast_capacity_failover to validate --requeue=expedite behavior
when ICE (Insufficient Capacity Error) occurs.
Test strategy:
This validates that expedited requeue jobs are treated as highest priority in the system, even over jobs submitted earlier in the queue.
Tests
Checklist
developadd the branch name as prefix in the PR title (e.g.[release-3.6]).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.