Skip to content

Update Slurm workflows to use GitHub-hosted runners with SSH#953

Merged
KeitaW merged 6 commits intomainfrom
update-slurm-workflows-ssh
Feb 16, 2026
Merged

Update Slurm workflows to use GitHub-hosted runners with SSH#953
KeitaW merged 6 commits intomainfrom
update-slurm-workflows-ssh

Conversation

@paragao
Copy link
Contributor

@paragao paragao commented Feb 13, 2026

Summary

This PR updates the Slurm-based workflows to use GitHub-hosted runners with SSH connections to the p5en cluster instead of self-hosted runners.

Changes

Updated Workflows

  1. fsdp-regression-test-container.yml

    • Changed from self-hosted runners to ubuntu-latest
    • Added AWS OIDC authentication
    • Added SSH setup with retry logic
    • Container builds now happen on cluster via SSH
    • Real-time log streaming during job execution
    • Automatic job cancellation on workflow abort
    • Enroot images cleaned up after last test
  2. fsdp-regression-test-venv.yml

    • Changed from self-hosted runners to ubuntu-latest
    • Added AWS OIDC authentication
    • Virtual environment creation happens on cluster via SSH
    • Real-time log streaming during job execution
    • Automatic cleanup of all resources
  3. megatron-ci-slurm.yaml

    • Changed from self-hosted runners to ubuntu-latest
    • Added AWS OIDC authentication
    • Container builds on cluster via SSH
    • Added dedicated cleanup job for enroot images

New Workflow

  1. pr-review-and-slurm-test.yml
    • Comprehensive PR review workflow
    • Code analysis and security scanning
    • Version validation (EFA ≥1.47.0, NCCL ≥2.28, CUDA ≥13.0)
    • Slurm testing on 8 p5en.48xlarge nodes
    • Results saved as <PR#>-<date>-results.json

Infrastructure Details

  • Cluster: p5en.smml.aiml.aws.dev
  • User: ghactions
  • AWS Role: arn:aws:iam::159553542841:role/awslabs-AOSH-GitHubActionsRole
  • Paths:
    • Code/Checkpoints: /fsx/agents/pr-reviews/
    • Logs: /home/ghactions/
    • Enroot images: /fsx/agents/enroot-images/

Required Secrets

  • SLURM_SSH_KEY: SSH private key for cluster access

Testing

  • FSDP container workflow
  • FSDP venv workflow
  • Megatron CI workflow
  • PR review workflow

Notes

  • SSH connections include retry logic (5 attempts for keyscan, 3 for transfers)
  • Job monitoring includes real-time log streaming every 30 seconds
  • All workflows properly cancel remote Slurm jobs on abort
  • Timeout adjusted to include SSH overhead (+15 minutes)

pbelevich and others added 5 commits August 5, 2025 16:49
- Add Table of Contents and improve document structure
- Add Prerequisites section with NeMo version compatibility matrix
- Update EFA installation instructions with links to AWS docs
- Add Environment Variables configuration section
- Add Expected Outputs section explaining performance metrics
- Add Multi-Node Distributed Training section with examples
- Fix 'error' placeholder with explanatory note about B200 configuration
- Add section headers for Mixtral, Nemotron, and DeepSeek models
- Improve Fine-Tuning section with better formatting and HF_TOKEN note
- Add comprehensive Troubleshooting section
- Update EFA installer version from 1.43.1 to 1.47.0 in Dockerfile

Fixes review comments from nghtm and KeitaW
- Migrate fsdp-regression-test-container.yml from self-hosted to ubuntu-latest + SSH
- Migrate fsdp-regression-test-venv.yml from self-hosted to ubuntu-latest + SSH
- Migrate megatron-ci-slurm.yaml from self-hosted to ubuntu-latest + SSH
- Add AWS OIDC authentication for all workflows
- Add real-time log streaming from p5en.smml.aiml.aws.dev cluster
- Add SSH retry logic and job cancellation on workflow abort
- Implement enroot image cleanup after test completion
- Add new pr-review-and-slurm-test.yml for comprehensive PR testing
- Use /fsx/agents/pr-reviews/ for code/checkpoints and /home/ghactions for logs
Copy link
Contributor

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@KeitaW KeitaW merged commit 7d81153 into main Feb 16, 2026
3 checks passed
@KeitaW KeitaW deleted the update-slurm-workflows-ssh branch February 16, 2026 02:03
KeitaW added a commit that referenced this pull request Feb 17, 2026
* NeMo 2 Performance instructions

* Update PERFORMANCE.md

* Update PERFORMANCE.md

* Address review feedback for NeMo 2 Performance instructions

- Add Table of Contents and improve document structure
- Add Prerequisites section with NeMo version compatibility matrix
- Update EFA installation instructions with links to AWS docs
- Add Environment Variables configuration section
- Add Expected Outputs section explaining performance metrics
- Add Multi-Node Distributed Training section with examples
- Fix 'error' placeholder with explanatory note about B200 configuration
- Add section headers for Mixtral, Nemotron, and DeepSeek models
- Improve Fine-Tuning section with better formatting and HF_TOKEN note
- Add comprehensive Troubleshooting section
- Update EFA installer version from 1.43.1 to 1.47.0 in Dockerfile

Fixes review comments from nghtm and KeitaW

* Update Slurm workflows to use GitHub-hosted runners with SSH

- Migrate fsdp-regression-test-container.yml from self-hosted to ubuntu-latest + SSH
- Migrate fsdp-regression-test-venv.yml from self-hosted to ubuntu-latest + SSH
- Migrate megatron-ci-slurm.yaml from self-hosted to ubuntu-latest + SSH
- Add AWS OIDC authentication for all workflows
- Add real-time log streaming from p5en.smml.aiml.aws.dev cluster
- Add SSH retry logic and job cancellation on workflow abort
- Implement enroot image cleanup after test completion
- Add new pr-review-and-slurm-test.yml for comprehensive PR testing
- Use /fsx/agents/pr-reviews/ for code/checkpoints and /home/ghactions for logs

---------

Co-authored-by: Pavel Belevich <belevichp@gmail.com>
Co-authored-by: Pavel Belevich <belevich@amazon.com>
Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants