Skip to content

Conversation

@BounharAbdelaziz
Copy link
Contributor

This PR adds a full multi-node Slurm training script to run SkyRL on HPC clusters. It addresses the issue #709. The script is optimized for NVIDIA H200 GPUs.

The script:

  • Automatically sets up the environment and repository paths.
  • Launches a Ray head node and Ray worker nodes across all allocated Slurm nodes.
  • Detects GPUs and configures NCCL/UCX for fast multi-node communication.
  • Verifies the Ray cluster is correctly formed before training.
  • Starts SkyRL SAPO training through srun, with support for checkpointing and automatic resume.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive Slurm script for multi-node training, which is a great addition for HPC users. The script is well-structured and covers the key steps from environment setup to launching Ray and the training job. My review focuses on improving the script's robustness and maintainability. I've identified a critical bug in the GPU availability check for worker nodes and suggest replacing brittle sleep-based synchronization with more reliable polling mechanisms. Additionally, I've provided suggestions to make IP address detection more robust and to improve the script's portability and configuration management.

Comment on lines +213 to +219
until nvidia-smi > /dev/null 2>&1; do
echo "Waiting for GPU visibility..."
sleep 2
done
srun --nodes=1 --ntasks=1 -w "$node_i" \
uv run --python "$PYTHON" --active -- ray start --address "$ip_head" \
--num-cpus "$SLURM_CPUS_PER_TASK" --num-gpus "$SLURM_GPUS_PER_NODE" --block &
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This until nvidia-smi loop is intended to wait for GPUs to become visible on worker nodes, but it's executed on the launching node (the head node), not on the target worker node ($node_i). This is a critical bug, as it doesn't actually check the GPU status on the workers. To fix this, the check must be executed on the worker node itself, for example by wrapping both the check and the ray start command within the srun command's payload.

Suggested change
until nvidia-smi > /dev/null 2>&1; do
echo "Waiting for GPU visibility..."
sleep 2
done
srun --nodes=1 --ntasks=1 -w "$node_i" \
uv run --python "$PYTHON" --active -- ray start --address "$ip_head" \
--num-cpus "$SLURM_CPUS_PER_TASK" --num-gpus "$SLURM_GPUS_PER_NODE" --block &
srun --nodes=1 --ntasks=1 -w "$node_i" bash -c "
until nvidia-smi > /dev/null 2>&1; do
echo \"Waiting for GPU visibility on worker \$HOSTNAME...\"
sleep 2
done
uv run --python \"$PYTHON\" --active -- ray start --address \"$ip_head\" \
--num-cpus \"$SLURM_CPUS_PER_TASK\" --num-gpus \"$SLURM_GPUS_PER_NODE\" --block
" &

Comment on lines +175 to +186
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# Convert to IPv4 if needed
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
head_node_ip=${ADDR[1]}
else
head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. Using IPV4: $head_node_ip"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current logic for extracting an IPv4 address from hostname --ip-address is complex and potentially brittle, as it relies on string length heuristics to differentiate between IPv4 and IPv6. A more robust and simpler approach is to use hostname -I, which is designed to print the host's network addresses, and then select the first one.

Suggested change
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# Convert to IPv4 if needed
if [[ "$head_node_ip" == *" "* ]]; then
IFS=' ' read -ra ADDR <<<"$head_node_ip"
if [[ ${#ADDR[0]} -gt 16 ]]; then
head_node_ip=${ADDR[1]}
else
head_node_ip=${ADDR[0]}
fi
echo "IPV6 address detected. Using IPV4: $head_node_ip"
fi
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname -I | awk '{print $1}')

uv run --python "$PYTHON" --active -- ray start --head --node-ip-address="$head_node_ip" \
--port=$port --num-cpus "$SLURM_CPUS_PER_TASK" --num-gpus "$SLURM_GPUS_PER_NODE" --block &

sleep 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using a fixed sleep to wait for the Ray head node to start is unreliable, as the startup time can vary depending on system load. This can lead to race conditions where subsequent steps fail because the head node is not yet ready. A more robust approach is to actively poll for the head node's availability, for example by using ray health-check in a loop. The same principle applies to the sleeps on lines 220 and 224 while waiting for workers.


export PYTHONPATH="$PROJECT_DIR:$PYTHONPATH"

PYTHON="$REPO_ROOT/skyrl/bin/python"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path to the Python executable is hardcoded, which ties the script to a specific virtual environment layout (skyrl/bin/python). This reduces portability. To make the script more flexible, you can allow this path to be overridden via an environment variable, while keeping the current path as a default.

Suggested change
PYTHON="$REPO_ROOT/skyrl/bin/python"
PYTHON=${PYTHON:-"$REPO_ROOT/skyrl/bin/python"}

Comment on lines +242 to +303
srun --overlap --nodes=$NNODES --ntasks=1 -w "$head_node" \
uv run --python "$PYTHON" --active --extra vllm \
-m examples.algorithms.dapo.main_dapo \
data.train_data="['$TRAIN_FILE']" \
data.val_data="['$TEST_FILE']" \
trainer.algorithm.advantage_estimator="$ADVANTAGE_ESTIMATOR" \
trainer.algorithm.policy_loss_type="$POLICY_LOSS_TYPE" \
+trainer.algorithm.overlong_buffer.len=$OVERLONG_BUFFER_LEN \
+trainer.algorithm.overlong_buffer.penalty_factor=$OVERLONG_BUFFER_PENALTY_FACTOR \
trainer.algorithm.loss_reduction=$LOSS_REDUCTION \
generator.enforce_eager=$ENFORCE_EAGER \
generator.apply_overlong_filtering=$APPLY_OVERLONG_FILTERING \
generator.sampling_params.temperature=$TEMPERATURE \
generator.sampling_params.top_p=$TOP_P \
generator.eval_sampling_params.top_p=$EVAL_TOP_P \
generator.eval_sampling_params.temperature=$TEMPERATURE \
trainer.algorithm.use_kl_loss=$USE_KL_LOSS \
trainer.algorithm.clip_ratio_c=$CLIP_RATIO_C \
trainer.policy.model.path="$MODEL_NAME" \
trainer.placement.colocate_all=$COLOCATE_ALL \
trainer.strategy="$STRATEGY" \
trainer.placement.policy_num_nodes=$NUM_NODES \
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS_PER_NODE \
trainer.policy.fsdp_config.fsdp_size=$NUM_GPUS_PER_NODE \
generator.num_inference_engines=$NUM_INFERENCE_ENGINES \
generator.inference_engine_tensor_parallel_size=$INFERENCE_ENGINE_TENSOR_PARALLEL_SIZE \
trainer.epochs=$EPOCHS \
trainer.algorithm.tau_pos=$TAU_POS \
trainer.algorithm.tau_neg=$TAU_NEG \
trainer.eval_batch_size=$EVAL_BATCH_SIZE \
trainer.eval_before_train=$EVAL_BEFORE_TRAIN \
trainer.eval_interval=$EVAL_INTERVAL \
trainer.update_epochs_per_batch=$UPDATE_EPOCHS_PER_BATCH \
trainer.train_batch_size=$TRAIN_BATCH_SIZE \
trainer.policy_mini_batch_size=$MINI_BATCH_SIZE \
trainer.micro_forward_batch_size_per_gpu=$MICRO_FORWARD_BATCH_SIZE \
trainer.micro_train_batch_size_per_gpu=$MICRO_TRAIN_BATCH_SIZE \
trainer.ckpt_interval=$CKPT_INTERVAL \
trainer.max_prompt_length=$MAX_PROMPT_LENGTH \
generator.sampling_params.max_generate_length=$MAX_RESPONSE_LENGTH \
trainer.policy.optimizer_config.lr=$LR \
trainer.policy.optimizer_config.num_warmup_steps=$NUM_WARMUP_STEPS \
trainer.policy.optimizer_config.weight_decay=$WEIGHT_DECAY \
trainer.policy.optimizer_config.max_grad_norm=$MAX_GRAD_NORM \
generator.backend="$GENERATOR_BACKEND" \
generator.run_engines_locally=$RUN_ENGINES_LOCALLY \
generator.weight_sync_backend="$WEIGHT_SYNC_BACKEND" \
generator.async_engine=$ASYNC_ENGINE \
generator.batched=$BATCHED \
environment.env_class="$ENV_CLASS" \
generator.n_samples_per_prompt=$N_SAMPLES_PER_PROMPT \
generator.eval_n_samples_per_prompt=$EVAL_N_SAMPLES_PER_PROMPT \
generator.gpu_memory_utilization=$GPU_MEMORY_UTILIZATION \
trainer.logger="$LOGGER" \
trainer.project_name="$PROJECT_NAME" \
trainer.run_name="$RUN_NAME" \
trainer.export_path="$EXPORT_PATH" \
trainer.hf_save_interval=$HF_SAVE_INTERVAL \
trainer.resume_mode="$RESUME_MODE" \
trainer.max_ckpts_to_keep=$MAX_CKPTS_TO_KEEP \
trainer.ckpt_path="$CKPT_PATH" \
$@ No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This srun command has a very long list of command-line arguments, which makes the script difficult to read and maintain. Since the training script uses Hydra, a better practice would be to define these parameters in a dedicated Hydra configuration file (e.g., conf/my_experiment.yaml). The shell script would then only need to override a few dynamic values (like file paths or resource counts), making it much cleaner and the experiment configuration more reusable and version-controlled.

trainer.resume_mode="$RESUME_MODE" \
trainer.max_ckpts_to_keep=$MAX_CKPTS_TO_KEEP \
trainer.ckpt_path="$CKPT_PATH" \
$@ No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The file is missing a newline character at the end. It's a POSIX standard and a common convention to end text files with a newline. This prevents issues with certain tools like cat and can avoid unexpected behavior in shell scripts.

@erictang000 erictang000 self-assigned this Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants