[skyrl-train] Add multi-node slurm training script #766

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

BounharAbdelaziz wants to merge 1 commit into NovaSky-AI:main from BounharAbdelaziz:slurm-clean

Contributor

BounharAbdelaziz commented Dec 10, 2025

This PR adds a full multi-node Slurm training script to run SkyRL on HPC clusters. It addresses the issue #709. The script is optimized for NVIDIA H200 GPUs.

The script:

Automatically sets up the environment and repository paths.
Launches a Ray head node and Ray worker nodes across all allocated Slurm nodes.
Detects GPUs and configures NCCL/UCX for fast multi-node communication.
Verifies the Ray cluster is correctly formed before training.
Starts SkyRL SAPO training through srun, with support for checkpointing and automatic resume.


          Add multi-node slurm training script

e50a111

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This pull request introduces a comprehensive Slurm script for multi-node training, which is a great addition for HPC users. The script is well-structured and covers the key steps from environment setup to launching Ray and the training job. My review focuses on improving the script's robustness and maintainability. I've identified a critical bug in the GPU availability check for worker nodes and suggest replacing brittle sleep-based synchronization with more reliable polling mechanisms. Additionally, I've provided suggestions to make IP address detection more robust and to improve the script's portability and configuration management.

skyrl-train/examples/slurm_multi_node/slurm_multinode_sapo.sh

Comment on lines +213 to +219

+                until nvidia-smi > /dev/null 2>&1; do
+                  echo "Waiting for GPU visibility..."
+                  sleep 2
+                done
+                srun --nodes=1 --ntasks=1 -w "$node_i" \
+                  uv run --python "$PYTHON" --active -- ray start --address "$ip_head" \
+                  --num-cpus "$SLURM_CPUS_PER_TASK" --num-gpus "$SLURM_GPUS_PER_NODE" --block &

Contributor

gemini-code-assist bot Dec 10, 2025

This until nvidia-smi loop is intended to wait for GPUs to become visible on worker nodes, but it's executed on the launching node (the head node), not on the target worker node ($node_i). This is a critical bug, as it doesn't actually check the GPU status on the workers. To fix this, the check must be executed on the worker node itself, for example by wrapping both the check and the ray start command within the srun command's payload.

Suggested change

      
              until nvidia-smi > /dev/null 2>&1; do
          
                echo "Waiting for GPU visibility..."
          
                sleep 2
          
              done
          
              srun --nodes=1 --ntasks=1 -w "$node_i" \
          
                uv run --python "$PYTHON" --active -- ray start --address "$ip_head" \
          
                --num-cpus "$SLURM_CPUS_PER_TASK" --num-gpus "$SLURM_GPUS_PER_NODE" --block &
          
              srun --nodes=1 --ntasks=1 -w "$node_i" bash -c "
          
                until nvidia-smi > /dev/null 2>&1; do
          
                  echo \"Waiting for GPU visibility on worker \$HOSTNAME...\"
          
                  sleep 2
          
                done
          
                uv run --python \"$PYTHON\" --active -- ray start --address \"$ip_head\" \
          
                  --num-cpus \"$SLURM_CPUS_PER_TASK\" --num-gpus \"$SLURM_GPUS_PER_NODE\" --block
          
              " &

skyrl-train/examples/slurm_multi_node/slurm_multinode_sapo.sh

Comment on lines +175 to +186

+              head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
+              # Convert to IPv4 if needed
+              if [[ "$head_node_ip" == *" "* ]]; then
+                IFS=' ' read -ra ADDR <<<"$head_node_ip"
+                if [[ ${#ADDR[0]} -gt 16 ]]; then
+                  head_node_ip=${ADDR[1]}
+                else
+                  head_node_ip=${ADDR[0]}
+                fi
+                echo "IPV6 address detected. Using IPV4: $head_node_ip"
+              fi

Contributor

gemini-code-assist bot Dec 10, 2025

The current logic for extracting an IPv4 address from hostname --ip-address is complex and potentially brittle, as it relies on string length heuristics to differentiate between IPv4 and IPv6. A more robust and simpler approach is to use hostname -I, which is designed to print the host's network addresses, and then select the first one.

Suggested change

      
            head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
          
            # Convert to IPv4 if needed
          
            if [[ "$head_node_ip" == *" "* ]]; then
          
              IFS=' ' read -ra ADDR <<<"$head_node_ip"
          
              if [[ ${#ADDR[0]} -gt 16 ]]; then
          
                head_node_ip=${ADDR[1]}
          
              else
          
                head_node_ip=${ADDR[0]}
          
              fi
          
              echo "IPV6 address detected. Using IPV4: $head_node_ip"
          
            fi
          
            head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname -I | awk '{print $1}')

skyrl-train/examples/slurm_multi_node/slurm_multinode_sapo.sh

+                uv run --python "$PYTHON" --active -- ray start --head --node-ip-address="$head_node_ip" \
+                --port=$port --num-cpus "$SLURM_CPUS_PER_TASK" --num-gpus "$SLURM_GPUS_PER_NODE" --block &
+              sleep 10

Contributor

gemini-code-assist bot Dec 10, 2025

Using a fixed sleep to wait for the Ray head node to start is unreliable, as the startup time can vary depending on system load. This can lead to race conditions where subsequent steps fail because the head node is not yet ready. A more robust approach is to actively poll for the head node's availability, for example by using ray health-check in a loop. The same principle applies to the sleeps on lines 220 and 224 while waiting for workers.

skyrl-train/examples/slurm_multi_node/slurm_multinode_sapo.sh


		export PYTHONPATH="$PROJECT_DIR:$PYTHONPATH"

		PYTHON="$REPO_ROOT/skyrl/bin/python"

Contributor

gemini-code-assist bot Dec 10, 2025

The path to the Python executable is hardcoded, which ties the script to a specific virtual environment layout (skyrl/bin/python). This reduces portability. To make the script more flexible, you can allow this path to be overridden via an environment variable, while keeping the current path as a default.

Suggested change

      
            PYTHON="$REPO_ROOT/skyrl/bin/python"
          
            PYTHON=${PYTHON:-"$REPO_ROOT/skyrl/bin/python"}

skyrl-train/examples/slurm_multi_node/slurm_multinode_sapo.sh

Comment on lines +242 to +303

+              srun --overlap --nodes=$NNODES --ntasks=1 -w "$head_node" \
+                  uv run --python "$PYTHON" --active --extra vllm \
+                  -m examples.algorithms.dapo.main_dapo \
+                  data.train_data="['$TRAIN_FILE']" \
+                  data.val_data="['$TEST_FILE']" \
+                  trainer.algorithm.advantage_estimator="$ADVANTAGE_ESTIMATOR" \
+                  trainer.algorithm.policy_loss_type="$POLICY_LOSS_TYPE" \
+                  +trainer.algorithm.overlong_buffer.len=$OVERLONG_BUFFER_LEN \
+                  +trainer.algorithm.overlong_buffer.penalty_factor=$OVERLONG_BUFFER_PENALTY_FACTOR \
+                  trainer.algorithm.loss_reduction=$LOSS_REDUCTION \
+                  generator.enforce_eager=$ENFORCE_EAGER \
+                  generator.apply_overlong_filtering=$APPLY_OVERLONG_FILTERING \
+                  generator.sampling_params.temperature=$TEMPERATURE \
+                  generator.sampling_params.top_p=$TOP_P \
+                  generator.eval_sampling_params.top_p=$EVAL_TOP_P \
+                  generator.eval_sampling_params.temperature=$TEMPERATURE \
+                  trainer.algorithm.use_kl_loss=$USE_KL_LOSS \
+                  trainer.algorithm.clip_ratio_c=$CLIP_RATIO_C \
+                  trainer.policy.model.path="$MODEL_NAME" \
+                  trainer.placement.colocate_all=$COLOCATE_ALL \
+                  trainer.strategy="$STRATEGY" \
+                  trainer.placement.policy_num_nodes=$NUM_NODES \
+                  trainer.placement.policy_num_gpus_per_node=$NUM_GPUS_PER_NODE \
+                  trainer.policy.fsdp_config.fsdp_size=$NUM_GPUS_PER_NODE \
+                  generator.num_inference_engines=$NUM_INFERENCE_ENGINES \
+                  generator.inference_engine_tensor_parallel_size=$INFERENCE_ENGINE_TENSOR_PARALLEL_SIZE \
+                  trainer.epochs=$EPOCHS \
+                  trainer.algorithm.tau_pos=$TAU_POS \
+                  trainer.algorithm.tau_neg=$TAU_NEG \
+                  trainer.eval_batch_size=$EVAL_BATCH_SIZE \
+                  trainer.eval_before_train=$EVAL_BEFORE_TRAIN \
+                  trainer.eval_interval=$EVAL_INTERVAL \
+                  trainer.update_epochs_per_batch=$UPDATE_EPOCHS_PER_BATCH \
+                  trainer.train_batch_size=$TRAIN_BATCH_SIZE \
+                  trainer.policy_mini_batch_size=$MINI_BATCH_SIZE \
+                  trainer.micro_forward_batch_size_per_gpu=$MICRO_FORWARD_BATCH_SIZE \
+                  trainer.micro_train_batch_size_per_gpu=$MICRO_TRAIN_BATCH_SIZE \
+                  trainer.ckpt_interval=$CKPT_INTERVAL \
+                  trainer.max_prompt_length=$MAX_PROMPT_LENGTH \
+                  generator.sampling_params.max_generate_length=$MAX_RESPONSE_LENGTH \
+                  trainer.policy.optimizer_config.lr=$LR \
+                  trainer.policy.optimizer_config.num_warmup_steps=$NUM_WARMUP_STEPS \
+                  trainer.policy.optimizer_config.weight_decay=$WEIGHT_DECAY \
+                  trainer.policy.optimizer_config.max_grad_norm=$MAX_GRAD_NORM \
+                  generator.backend="$GENERATOR_BACKEND" \
+                  generator.run_engines_locally=$RUN_ENGINES_LOCALLY \
+                  generator.weight_sync_backend="$WEIGHT_SYNC_BACKEND" \
+                  generator.async_engine=$ASYNC_ENGINE \
+                  generator.batched=$BATCHED \
+                  environment.env_class="$ENV_CLASS" \
+                  generator.n_samples_per_prompt=$N_SAMPLES_PER_PROMPT \
+                  generator.eval_n_samples_per_prompt=$EVAL_N_SAMPLES_PER_PROMPT \
+                  generator.gpu_memory_utilization=$GPU_MEMORY_UTILIZATION \
+                  trainer.logger="$LOGGER" \
+                  trainer.project_name="$PROJECT_NAME" \
+                  trainer.run_name="$RUN_NAME" \
+                  trainer.export_path="$EXPORT_PATH" \
+                  trainer.hf_save_interval=$HF_SAVE_INTERVAL \
+                  trainer.resume_mode="$RESUME_MODE" \
+                  trainer.max_ckpts_to_keep=$MAX_CKPTS_TO_KEEP \
+                  trainer.ckpt_path="$CKPT_PATH" \
+                  $@

                
                    No newline at end of file

Contributor

gemini-code-assist bot Dec 10, 2025

This srun command has a very long list of command-line arguments, which makes the script difficult to read and maintain. Since the training script uses Hydra, a better practice would be to define these parameters in a dedicated Hydra configuration file (e.g., conf/my_experiment.yaml). The shell script would then only need to override a few dynamic values (like file paths or resource counts), making it much cleaner and the experiment configuration more reusable and version-controlled.

skyrl-train/examples/slurm_multi_node/slurm_multinode_sapo.sh

+                  trainer.resume_mode="$RESUME_MODE" \
+                  trainer.max_ckpts_to_keep=$MAX_CKPTS_TO_KEEP \
+                  trainer.ckpt_path="$CKPT_PATH" \
+                  $@

                
                    No newline at end of file

Contributor

gemini-code-assist bot Dec 10, 2025

The file is missing a newline character at the end. It's a POSIX standard and a common convention to end text files with a newline. This prevents issues with certain tools like cat and can avoid unexpected behavior in shell scripts.

erictang000 self-assigned this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet