Skip to content

[QUESTION] torch broadcast error #62

@sallyjunjun

Description

@sallyjunjun

When I run run_open_llama_w_vescale.py with torch version 2.5.1+cu124, I met the following error:

[rank4]: Traceback (most recent call last):
[rank4]: File "/code/veScale/examples/open_llama_4D_benchmark/run_open_llama_w_vescale-ljx.py", line 104, in
[rank4]: vescale_model = parallelize_module(model, device_mesh["TP"], sharding_plan)
[rank4]: File "/code/veScale/vescale/dmodule/api.py", line 276, in parallelize_module
[rank4]: DModule.init_parameters(module, is_model_sharded)
[rank4]: File "/code/veScale/vescale/dmodule/_dmodule.py", line 302, in init_parameters
[rank4]: buffer = DModule._distribute_parameter(buffer, module._device_mesh, buffer_pi, is_sharded)
[rank4]: File "/miniconda3-new/envs/llm-cuda12.4-vescale/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank4]: return func(*args, **kwargs)
[rank4]: File "/code/veScale/vescale/dmodule/_dmodule.py", line 266, in _distribute_parameter
[rank4]: dt = distribute_tensor(t, device_mesh, pi.placements)
[rank4]: File "/code/veScale/vescale/dtensor/api.py", line 252, in distribute_tensor
[rank4]: local_tensor = _replicate_tensor(local_tensor, device_mesh, idx)
[rank4]: File "/code/veScale/vescale/dtensor/redistribute.py", line 191, in _replicate_tensor
[rank4]: tensor = mesh_broadcast(tensor, mesh, mesh_dim=mesh_dim)
[rank4]: File "/code/veScale/vescale/dtensor/_collective_utils.py", line 273, in mesh_broadcast
[rank4]: aysnc_tensor = funcol.broadcast(tensor, src=src_for_dim, group=dim_group)
[rank4]: File "/miniconda3-new/envs/llm-cuda12.4-vescale/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 153, in broadcast
[rank4]: tensor = torch.ops._c10d_functional.broadcast(self, src, group_name)
[rank4]: File "/miniconda3-new/envs/llm-cuda12.4-vescale/lib/python3.10/site-packages/torch/_ops.py", line 1116, in call
[rank4]: return self._op(*args, **(kwargs or {}))
[rank4]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2777, invalid argument (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank4]: ncclInvalidArgument: Invalid value for an argument.
[rank4]: Last error:
[rank4]: Broadcast : invalid root 4 (root should be in the 0..4 range)

Is this because the torch version is not compatible?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions