Skip to content

Qwen2.5-32B RL with GRPO fails on 16 B200 GPUs #1974

@pmotgi

Description

@pmotgi

When I try to run a RL use-case with Qwen2.5-32B model on a ray cluster with 16 B200 Chips, it fails to run where step 1 of RL cycle starts and then when the policy is being trained the job fails without any specific errors.

A clear and concise description of what the bug is.

Steps/Code to reproduce bug

run the training job with qwen2.5 32B model:

uv run python examples/run_grpo_math.py
--config examples/configs/grpo_math_8B.yaml
cluster.num_nodes=2
cluster.gpus_per_node=8
policy.model_name="Qwen/Qwen2.5-32B-Instruct"
policy.tokenizer.name="Qwen/Qwen2.5-32B-Instruct"
policy.generation.vllm_cfg.tensor_parallel_size=1
policy.dtensor_cfg.tensor_parallel_size=1
policy.generation.vllm_cfg.gpu_memory_utilization=0.5
policy.generation.vllm_cfg.enforce_eager=True
policy.train_micro_batch_size=1
grpo.num_prompts_per_step=16
grpo.num_generations_per_prompt=32
grpo.max_num_steps=10
checkpointing.checkpoint_dir=/data/nemo_rl_qwen32b_gk_cp_2-13v2
data.dataset_name=ResponseDataset
+data.train_data_path=openai/gsm8k
+data.val_data_path=openai/gsm8k
+data.val_split=test
+data.train_split=train
+data.subset="main"
+data.input_key="question"
+data.output_key="answer"
logger.wandb_enabled=True
logger.wandb.name='qwen2-32b-gsm8k-grpo-2nodes-2-13-v2'

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Error:

(DTensorPolicyWorkerV2 pid=1908, ip=10.4.29.7) ) [repeated 78x across cluster] +----------------------------------------------------------------------------------------------------------------------------------------------------+ | Policy worker mapping to Nodes and GPUs | +------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+ | Node_IP | GPU_ID=0 | GPU_ID=1 | GPU_ID=2 | GPU_ID=3 | GPU_ID=4 | GPU_ID=5 | GPU_ID=6 | GPU_ID=7 | +------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+ | 10.4.1.20 | ('worker-8',) | ('worker-9',) | ('worker-10',) | ('worker-11',) | ('worker-12',) | ('worker-13',) | ('worker-14',) | ('worker-15',) | | 10.4.28.20 | ('worker-0',) | ('worker-1',) | ('worker-2',) | ('worker-3',) | ('worker-4',) | ('worker-5',) | ('worker-6',) | ('worker-7',) | | 10.4.29.7 | ('worker-16',) | ('worker-17',) | ('worker-18',) | ('worker-19',) | ('worker-20',) | ('worker-21',) | ('worker-22',) | ('worker-23',) | +------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+----------------+ File "/opt/nemo-rl/examples/run_grpo_math.py", line 260, in <module> main() File "/opt/nemo-rl/examples/run_grpo_math.py", line 192, in main ) = setup(config, tokenizer, dataset, val_dataset) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/nemo-rl/nemo_rl/algorithms/grpo.py", line 600, in setup policy_generation.prepare_refit_info(state_dict_info) File "/opt/nemo-rl/nemo_rl/models/generation/vllm/vllm_generation.py", line 768, in prepare_refit_info ray.get(futures) File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2882, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 970, in get_objects raise value ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: VllmGenerationWorker actor_id: 19a81898b69b2993864d4e7702000000 pid: 76181 name: vllm_policy-0-0 namespace: 4b619eeb-e644-4317-b775-c5c8267a6937 ip: 10.4.1.20 The actor died because its node has died. Node Id: 8a3b4ee957e3c9fa7021a6618105fc976695fc9e1add1be6369889c6 the actor's node was terminated unexpectedly: health check failed due to missing too many heartbeats Traceback (most recent call last): File "/opt/nemo-rl/examples/run_grpo_math.py", line 260, in <module> main() File "/opt/nemo-rl/examples/run_grpo_math.py", line 192, in main ) = setup(config, tokenizer, dataset, val_dataset) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/nemo-rl/nemo_rl/algorithms/grpo.py", line 600, in setup policy_generation.prepare_refit_info(state_dict_info) File "/opt/nemo-rl/nemo_rl/models/generation/vllm/vllm_generation.py", line 768, in prepare_refit_info ray.get(futures) File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2882, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 970, in get_objects raise value ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: VllmGenerationWorker actor_id: 19a81898b69b2993864d4e7702000000 pid: 76181 name: vllm_policy-0-0 namespace: 4b619eeb-e644-4317-b775-c5c8267a6937 ip: 10.4.1.20 The actor died because its node has died. Node Id: 8a3b4ee957e3c9fa7021a6618105fc976695fc9e1add1be6369889c6 the actor's node was terminated unexpectedly: health check failed due to missing too many heartbeats

Expected behavior

All the RL cycle training steps should complete with wanb report generated towards the end.

Additional context

llama3.1-8b model use-case works but any use-case with qwen2.5-32b fails

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions