Update Slurm workflows to use GitHub-hosted runners with SSH#953
Merged
Update Slurm workflows to use GitHub-hosted runners with SSH#953
Conversation
- Add Table of Contents and improve document structure - Add Prerequisites section with NeMo version compatibility matrix - Update EFA installation instructions with links to AWS docs - Add Environment Variables configuration section - Add Expected Outputs section explaining performance metrics - Add Multi-Node Distributed Training section with examples - Fix 'error' placeholder with explanatory note about B200 configuration - Add section headers for Mixtral, Nemotron, and DeepSeek models - Improve Fine-Tuning section with better formatting and HF_TOKEN note - Add comprehensive Troubleshooting section - Update EFA installer version from 1.43.1 to 1.47.0 in Dockerfile Fixes review comments from nghtm and KeitaW
- Migrate fsdp-regression-test-container.yml from self-hosted to ubuntu-latest + SSH - Migrate fsdp-regression-test-venv.yml from self-hosted to ubuntu-latest + SSH - Migrate megatron-ci-slurm.yaml from self-hosted to ubuntu-latest + SSH - Add AWS OIDC authentication for all workflows - Add real-time log streaming from p5en.smml.aiml.aws.dev cluster - Add SSH retry logic and job cancellation on workflow abort - Implement enroot image cleanup after test completion - Add new pr-review-and-slurm-test.yml for comprehensive PR testing - Use /fsx/agents/pr-reviews/ for code/checkpoints and /home/ghactions for logs
KeitaW
added a commit
that referenced
this pull request
Feb 17, 2026
* NeMo 2 Performance instructions * Update PERFORMANCE.md * Update PERFORMANCE.md * Address review feedback for NeMo 2 Performance instructions - Add Table of Contents and improve document structure - Add Prerequisites section with NeMo version compatibility matrix - Update EFA installation instructions with links to AWS docs - Add Environment Variables configuration section - Add Expected Outputs section explaining performance metrics - Add Multi-Node Distributed Training section with examples - Fix 'error' placeholder with explanatory note about B200 configuration - Add section headers for Mixtral, Nemotron, and DeepSeek models - Improve Fine-Tuning section with better formatting and HF_TOKEN note - Add comprehensive Troubleshooting section - Update EFA installer version from 1.43.1 to 1.47.0 in Dockerfile Fixes review comments from nghtm and KeitaW * Update Slurm workflows to use GitHub-hosted runners with SSH - Migrate fsdp-regression-test-container.yml from self-hosted to ubuntu-latest + SSH - Migrate fsdp-regression-test-venv.yml from self-hosted to ubuntu-latest + SSH - Migrate megatron-ci-slurm.yaml from self-hosted to ubuntu-latest + SSH - Add AWS OIDC authentication for all workflows - Add real-time log streaming from p5en.smml.aiml.aws.dev cluster - Add SSH retry logic and job cancellation on workflow abort - Implement enroot image cleanup after test completion - Add new pr-review-and-slurm-test.yml for comprehensive PR testing - Use /fsx/agents/pr-reviews/ for code/checkpoints and /home/ghactions for logs --------- Co-authored-by: Pavel Belevich <belevichp@gmail.com> Co-authored-by: Pavel Belevich <belevich@amazon.com> Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR updates the Slurm-based workflows to use GitHub-hosted runners with SSH connections to the p5en cluster instead of self-hosted runners.
Changes
Updated Workflows
fsdp-regression-test-container.yml
ubuntu-latestfsdp-regression-test-venv.yml
ubuntu-latestmegatron-ci-slurm.yaml
ubuntu-latestNew Workflow
<PR#>-<date>-results.jsonInfrastructure Details
/fsx/agents/pr-reviews//home/ghactions//fsx/agents/enroot-images/Required Secrets
SLURM_SSH_KEY: SSH private key for cluster accessTesting
Notes