-
Notifications
You must be signed in to change notification settings - Fork 215
[skyrl-train] Add multi-node slurm training script #766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive Slurm script for multi-node training, which is a great addition for HPC users. The script is well-structured and covers the key steps from environment setup to launching Ray and the training job. My review focuses on improving the script's robustness and maintainability. I've identified a critical bug in the GPU availability check for worker nodes and suggest replacing brittle sleep-based synchronization with more reliable polling mechanisms. Additionally, I've provided suggestions to make IP address detection more robust and to improve the script's portability and configuration management.
| until nvidia-smi > /dev/null 2>&1; do | ||
| echo "Waiting for GPU visibility..." | ||
| sleep 2 | ||
| done | ||
| srun --nodes=1 --ntasks=1 -w "$node_i" \ | ||
| uv run --python "$PYTHON" --active -- ray start --address "$ip_head" \ | ||
| --num-cpus "$SLURM_CPUS_PER_TASK" --num-gpus "$SLURM_GPUS_PER_NODE" --block & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This until nvidia-smi loop is intended to wait for GPUs to become visible on worker nodes, but it's executed on the launching node (the head node), not on the target worker node ($node_i). This is a critical bug, as it doesn't actually check the GPU status on the workers. To fix this, the check must be executed on the worker node itself, for example by wrapping both the check and the ray start command within the srun command's payload.
| until nvidia-smi > /dev/null 2>&1; do | |
| echo "Waiting for GPU visibility..." | |
| sleep 2 | |
| done | |
| srun --nodes=1 --ntasks=1 -w "$node_i" \ | |
| uv run --python "$PYTHON" --active -- ray start --address "$ip_head" \ | |
| --num-cpus "$SLURM_CPUS_PER_TASK" --num-gpus "$SLURM_GPUS_PER_NODE" --block & | |
| srun --nodes=1 --ntasks=1 -w "$node_i" bash -c " | |
| until nvidia-smi > /dev/null 2>&1; do | |
| echo \"Waiting for GPU visibility on worker \$HOSTNAME...\" | |
| sleep 2 | |
| done | |
| uv run --python \"$PYTHON\" --active -- ray start --address \"$ip_head\" \ | |
| --num-cpus \"$SLURM_CPUS_PER_TASK\" --num-gpus \"$SLURM_GPUS_PER_NODE\" --block | |
| " & |
| head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) | ||
|
|
||
| # Convert to IPv4 if needed | ||
| if [[ "$head_node_ip" == *" "* ]]; then | ||
| IFS=' ' read -ra ADDR <<<"$head_node_ip" | ||
| if [[ ${#ADDR[0]} -gt 16 ]]; then | ||
| head_node_ip=${ADDR[1]} | ||
| else | ||
| head_node_ip=${ADDR[0]} | ||
| fi | ||
| echo "IPV6 address detected. Using IPV4: $head_node_ip" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for extracting an IPv4 address from hostname --ip-address is complex and potentially brittle, as it relies on string length heuristics to differentiate between IPv4 and IPv6. A more robust and simpler approach is to use hostname -I, which is designed to print the host's network addresses, and then select the first one.
| head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) | |
| # Convert to IPv4 if needed | |
| if [[ "$head_node_ip" == *" "* ]]; then | |
| IFS=' ' read -ra ADDR <<<"$head_node_ip" | |
| if [[ ${#ADDR[0]} -gt 16 ]]; then | |
| head_node_ip=${ADDR[1]} | |
| else | |
| head_node_ip=${ADDR[0]} | |
| fi | |
| echo "IPV6 address detected. Using IPV4: $head_node_ip" | |
| fi | |
| head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname -I | awk '{print $1}') |
| uv run --python "$PYTHON" --active -- ray start --head --node-ip-address="$head_node_ip" \ | ||
| --port=$port --num-cpus "$SLURM_CPUS_PER_TASK" --num-gpus "$SLURM_GPUS_PER_NODE" --block & | ||
|
|
||
| sleep 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a fixed sleep to wait for the Ray head node to start is unreliable, as the startup time can vary depending on system load. This can lead to race conditions where subsequent steps fail because the head node is not yet ready. A more robust approach is to actively poll for the head node's availability, for example by using ray health-check in a loop. The same principle applies to the sleeps on lines 220 and 224 while waiting for workers.
|
|
||
| export PYTHONPATH="$PROJECT_DIR:$PYTHONPATH" | ||
|
|
||
| PYTHON="$REPO_ROOT/skyrl/bin/python" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path to the Python executable is hardcoded, which ties the script to a specific virtual environment layout (skyrl/bin/python). This reduces portability. To make the script more flexible, you can allow this path to be overridden via an environment variable, while keeping the current path as a default.
| PYTHON="$REPO_ROOT/skyrl/bin/python" | |
| PYTHON=${PYTHON:-"$REPO_ROOT/skyrl/bin/python"} |
| srun --overlap --nodes=$NNODES --ntasks=1 -w "$head_node" \ | ||
| uv run --python "$PYTHON" --active --extra vllm \ | ||
| -m examples.algorithms.dapo.main_dapo \ | ||
| data.train_data="['$TRAIN_FILE']" \ | ||
| data.val_data="['$TEST_FILE']" \ | ||
| trainer.algorithm.advantage_estimator="$ADVANTAGE_ESTIMATOR" \ | ||
| trainer.algorithm.policy_loss_type="$POLICY_LOSS_TYPE" \ | ||
| +trainer.algorithm.overlong_buffer.len=$OVERLONG_BUFFER_LEN \ | ||
| +trainer.algorithm.overlong_buffer.penalty_factor=$OVERLONG_BUFFER_PENALTY_FACTOR \ | ||
| trainer.algorithm.loss_reduction=$LOSS_REDUCTION \ | ||
| generator.enforce_eager=$ENFORCE_EAGER \ | ||
| generator.apply_overlong_filtering=$APPLY_OVERLONG_FILTERING \ | ||
| generator.sampling_params.temperature=$TEMPERATURE \ | ||
| generator.sampling_params.top_p=$TOP_P \ | ||
| generator.eval_sampling_params.top_p=$EVAL_TOP_P \ | ||
| generator.eval_sampling_params.temperature=$TEMPERATURE \ | ||
| trainer.algorithm.use_kl_loss=$USE_KL_LOSS \ | ||
| trainer.algorithm.clip_ratio_c=$CLIP_RATIO_C \ | ||
| trainer.policy.model.path="$MODEL_NAME" \ | ||
| trainer.placement.colocate_all=$COLOCATE_ALL \ | ||
| trainer.strategy="$STRATEGY" \ | ||
| trainer.placement.policy_num_nodes=$NUM_NODES \ | ||
| trainer.placement.policy_num_gpus_per_node=$NUM_GPUS_PER_NODE \ | ||
| trainer.policy.fsdp_config.fsdp_size=$NUM_GPUS_PER_NODE \ | ||
| generator.num_inference_engines=$NUM_INFERENCE_ENGINES \ | ||
| generator.inference_engine_tensor_parallel_size=$INFERENCE_ENGINE_TENSOR_PARALLEL_SIZE \ | ||
| trainer.epochs=$EPOCHS \ | ||
| trainer.algorithm.tau_pos=$TAU_POS \ | ||
| trainer.algorithm.tau_neg=$TAU_NEG \ | ||
| trainer.eval_batch_size=$EVAL_BATCH_SIZE \ | ||
| trainer.eval_before_train=$EVAL_BEFORE_TRAIN \ | ||
| trainer.eval_interval=$EVAL_INTERVAL \ | ||
| trainer.update_epochs_per_batch=$UPDATE_EPOCHS_PER_BATCH \ | ||
| trainer.train_batch_size=$TRAIN_BATCH_SIZE \ | ||
| trainer.policy_mini_batch_size=$MINI_BATCH_SIZE \ | ||
| trainer.micro_forward_batch_size_per_gpu=$MICRO_FORWARD_BATCH_SIZE \ | ||
| trainer.micro_train_batch_size_per_gpu=$MICRO_TRAIN_BATCH_SIZE \ | ||
| trainer.ckpt_interval=$CKPT_INTERVAL \ | ||
| trainer.max_prompt_length=$MAX_PROMPT_LENGTH \ | ||
| generator.sampling_params.max_generate_length=$MAX_RESPONSE_LENGTH \ | ||
| trainer.policy.optimizer_config.lr=$LR \ | ||
| trainer.policy.optimizer_config.num_warmup_steps=$NUM_WARMUP_STEPS \ | ||
| trainer.policy.optimizer_config.weight_decay=$WEIGHT_DECAY \ | ||
| trainer.policy.optimizer_config.max_grad_norm=$MAX_GRAD_NORM \ | ||
| generator.backend="$GENERATOR_BACKEND" \ | ||
| generator.run_engines_locally=$RUN_ENGINES_LOCALLY \ | ||
| generator.weight_sync_backend="$WEIGHT_SYNC_BACKEND" \ | ||
| generator.async_engine=$ASYNC_ENGINE \ | ||
| generator.batched=$BATCHED \ | ||
| environment.env_class="$ENV_CLASS" \ | ||
| generator.n_samples_per_prompt=$N_SAMPLES_PER_PROMPT \ | ||
| generator.eval_n_samples_per_prompt=$EVAL_N_SAMPLES_PER_PROMPT \ | ||
| generator.gpu_memory_utilization=$GPU_MEMORY_UTILIZATION \ | ||
| trainer.logger="$LOGGER" \ | ||
| trainer.project_name="$PROJECT_NAME" \ | ||
| trainer.run_name="$RUN_NAME" \ | ||
| trainer.export_path="$EXPORT_PATH" \ | ||
| trainer.hf_save_interval=$HF_SAVE_INTERVAL \ | ||
| trainer.resume_mode="$RESUME_MODE" \ | ||
| trainer.max_ckpts_to_keep=$MAX_CKPTS_TO_KEEP \ | ||
| trainer.ckpt_path="$CKPT_PATH" \ | ||
| $@ No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This srun command has a very long list of command-line arguments, which makes the script difficult to read and maintain. Since the training script uses Hydra, a better practice would be to define these parameters in a dedicated Hydra configuration file (e.g., conf/my_experiment.yaml). The shell script would then only need to override a few dynamic values (like file paths or resource counts), making it much cleaner and the experiment configuration more reusable and version-controlled.
| trainer.resume_mode="$RESUME_MODE" \ | ||
| trainer.max_ckpts_to_keep=$MAX_CKPTS_TO_KEEP \ | ||
| trainer.ckpt_path="$CKPT_PATH" \ | ||
| $@ No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds a full multi-node Slurm training script to run SkyRL on HPC clusters. It addresses the issue #709. The script is optimized for NVIDIA H200 GPUs.
The script: