-
Notifications
You must be signed in to change notification settings - Fork 208
Open
Labels
Description
Describe the bug
Cuda failure 217 'peer access is not supported between these two devices'
I created a toy pytorch script to broadcast tensors and it works without any issue.
I tried setting NCCL_P2P_DISABLE variable to 1, but I still see the same issue.
Steps/Code to reproduce bug
- Pull image: nvcr.io/nvidia/nemo-rl:v0.4.0
- Run docker:
docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo-rl:v0.4.0 /bin/bash - Execute the following command
uv run python examples/run_dpo.py \
policy.model_name="meta-llama/Llama-3.1-8B-Instruct" \
policy.train_global_batch_size=256 \
cluster.gpus_per_node=8
Expected behavior
I expected to see training related metrics displayed.
Additional context
Machine type : g6.48xlarge instance.
Topology:
(base) [ec2-user@ip-172-31-32-14 ~]$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU1 NODE X NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU2 NODE NODE X NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU3 NODE NODE NODE X SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU4 SYS SYS SYS SYS X NODE NODE NODE 48-95,144-191 1 N/A
GPU5 SYS SYS SYS SYS NODE X NODE NODE 48-95,144-191 1 N/A
GPU6 SYS SYS SYS SYS NODE NODE X NODE 48-95,144-191 1 N/A
GPU7 SYS SYS SYS SYS NODE NODE NODE X 48-95,144-191 1 N/A
Stack Trace:
(DTensorPolicyWorker pid=123067) Initializing DTensorPolicyWorker with is_vlm=False [repeated 7x across cluster]
(DTensorPolicyWorker pid=122863) [Rank 0] Loading model meta-llama/Llama-3.1-8B-Instruct on CPU... [repeated 8x across cluster]
(DTensorPolicyWorker pid=122863) [Rank 0] Initializing empty model for FSDP... [repeated 7x across cluster]
(DTensorPolicyWorker pid=123067) [Rank 7] Loading state dict from rank 0... [repeated 6x across cluster]
(DTensorPolicyWorker pid=122863) /opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:859: UserWarning: `_get_pg_default_device` will be deprecated, it only stays for backward-compatiblity reason. If you need to find a device for object collectives, please use `_get_object_coll_device`. If you need to query the device types supported by group, please use `_device_capability(group)`.
(DTensorPolicyWorker pid=122863) warnings.warn(
Traceback (most recent call last):
File "/opt/nemo-rl/examples/run_dpo.py", line 294, in <module>
main()
File "/opt/nemo-rl/examples/run_dpo.py", line 278, in main
) = setup(config, tokenizer, train_dataset, val_dataset)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/nemo_rl/algorithms/dpo.py", line 252, in setup
policy.print_node_ip_and_gpu_id()
File "/opt/nemo-rl/nemo_rl/models/policy/lm_policy.py", line 784, in print_node_ip_and_gpu_id
results = ray.get(
^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2822, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 932, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::lm_policy-0-7:DTensorPolicyWorker.__init__() (pid=123067, ip=172.17.0.2, actor_id=3cd94e421f161668c11abd2901000000, repr=DTensorPolicyWorker[rank=7])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/nemo-rl/nemo_rl/models/policy/dtensor_policy_worker.py", line 356, in __init__
set_model_state_dict(
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/checkpoint/state_dict.py", line 1277, in set_model_state_dict
return _load_model_state_dict(model, model_state_dict, info)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/checkpoint/state_dict.py", line 590, in _load_model_state_dict
_broadcast_state_dict(
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/_state_dict_utils.py", line 614, in _broadcast_state_dict
dist.broadcast_object_list(broadcast_list, src=0, group=pg)
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3483, in broadcast_object_list
broadcast(object_sizes_tensor, src=global_src, group=group)
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2714, in broadcast
work = group.broadcast([tensor], opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3356, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 217 'peer access is not supported between these two devices'
Please let me know if you need more information from my side.
coderabbitai