Skip to content

P2P issue when running DPOΒ #1683

@dhineshkumar-r

Description

@dhineshkumar-r

Describe the bug

Cuda failure 217 'peer access is not supported between these two devices'

I created a toy pytorch script to broadcast tensors and it works without any issue.

I tried setting NCCL_P2P_DISABLE variable to 1, but I still see the same issue.

Steps/Code to reproduce bug

  1. Pull image: nvcr.io/nvidia/nemo-rl:v0.4.0
  2. Run docker: docker run --rm -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo-rl:v0.4.0 /bin/bash
  3. Execute the following command
uv run python examples/run_dpo.py \
  policy.model_name="meta-llama/Llama-3.1-8B-Instruct" \
  policy.train_global_batch_size=256 \
  cluster.gpus_per_node=8

Expected behavior

I expected to see training related metrics displayed.

Additional context

Machine type : g6.48xlarge instance.
Topology:

(base) [ec2-user@ip-172-31-32-14 ~]$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NODE	NODE	NODE	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU1	NODE	 X 	NODE	NODE	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU2	NODE	NODE	 X 	NODE	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU3	NODE	NODE	NODE	 X 	SYS	SYS	SYS	SYS	0-47,96-143	0		N/A
GPU4	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE	48-95,144-191	1		N/A
GPU5	SYS	SYS	SYS	SYS	NODE	 X 	NODE	NODE	48-95,144-191	1		N/A
GPU6	SYS	SYS	SYS	SYS	NODE	NODE	 X 	NODE	48-95,144-191	1		N/A
GPU7	SYS	SYS	SYS	SYS	NODE	NODE	NODE	 X 	48-95,144-191	1		N/A

Stack Trace:

(DTensorPolicyWorker pid=123067) Initializing DTensorPolicyWorker with is_vlm=False [repeated 7x across cluster]
(DTensorPolicyWorker pid=122863) [Rank 0] Loading model meta-llama/Llama-3.1-8B-Instruct on CPU... [repeated 8x across cluster]
(DTensorPolicyWorker pid=122863) [Rank 0] Initializing empty model for FSDP... [repeated 7x across cluster]
(DTensorPolicyWorker pid=123067) [Rank 7] Loading state dict from rank 0... [repeated 6x across cluster]
(DTensorPolicyWorker pid=122863) /opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:859: UserWarning: `_get_pg_default_device` will be deprecated, it only stays for backward-compatiblity reason. If you need to find a device for object collectives, please use `_get_object_coll_device`. If you need to query the device types supported by group, please use `_device_capability(group)`.
(DTensorPolicyWorker pid=122863)   warnings.warn(
Traceback (most recent call last):
  File "/opt/nemo-rl/examples/run_dpo.py", line 294, in <module>
    main()
  File "/opt/nemo-rl/examples/run_dpo.py", line 278, in main
    ) = setup(config, tokenizer, train_dataset, val_dataset)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/nemo_rl/algorithms/dpo.py", line 252, in setup
    policy.print_node_ip_and_gpu_id()
  File "/opt/nemo-rl/nemo_rl/models/policy/lm_policy.py", line 784, in print_node_ip_and_gpu_id
    results = ray.get(
              ^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo_rl_venv/lib/python3.12/site-packages/ray/_private/worker.py", line 932, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::lm_policy-0-7:DTensorPolicyWorker.__init__() (pid=123067, ip=172.17.0.2, actor_id=3cd94e421f161668c11abd2901000000, repr=DTensorPolicyWorker[rank=7])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/nemo-rl/nemo_rl/models/policy/dtensor_policy_worker.py", line 356, in __init__
    set_model_state_dict(
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/checkpoint/state_dict.py", line 1277, in set_model_state_dict
    return _load_model_state_dict(model, model_state_dict, info)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/checkpoint/state_dict.py", line 590, in _load_model_state_dict
    _broadcast_state_dict(
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/_state_dict_utils.py", line 614, in _broadcast_state_dict
    dist.broadcast_object_list(broadcast_list, src=0, group=pg)
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3483, in broadcast_object_list
    broadcast(object_sizes_tensor, src=global_src, group=group)
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/ray_venvs/nemo_rl.models.policy.dtensor_policy_worker.DTensorPolicyWorker/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 2714, in broadcast
    work = group.broadcast([tensor], opts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3356, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 217 'peer access is not supported between these two devices'

Please let me know if you need more information from my side.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions