-
Notifications
You must be signed in to change notification settings - Fork 53
Description
When I run run_open_llama_w_vescale.py with torch version 2.5.1+cu124, I met the following error:
[rank4]: Traceback (most recent call last):
[rank4]: File "/code/veScale/examples/open_llama_4D_benchmark/run_open_llama_w_vescale-ljx.py", line 104, in
[rank4]: vescale_model = parallelize_module(model, device_mesh["TP"], sharding_plan)
[rank4]: File "/code/veScale/vescale/dmodule/api.py", line 276, in parallelize_module
[rank4]: DModule.init_parameters(module, is_model_sharded)
[rank4]: File "/code/veScale/vescale/dmodule/_dmodule.py", line 302, in init_parameters
[rank4]: buffer = DModule._distribute_parameter(buffer, module._device_mesh, buffer_pi, is_sharded)
[rank4]: File "/miniconda3-new/envs/llm-cuda12.4-vescale/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank4]: return func(*args, **kwargs)
[rank4]: File "/code/veScale/vescale/dmodule/_dmodule.py", line 266, in _distribute_parameter
[rank4]: dt = distribute_tensor(t, device_mesh, pi.placements)
[rank4]: File "/code/veScale/vescale/dtensor/api.py", line 252, in distribute_tensor
[rank4]: local_tensor = _replicate_tensor(local_tensor, device_mesh, idx)
[rank4]: File "/code/veScale/vescale/dtensor/redistribute.py", line 191, in _replicate_tensor
[rank4]: tensor = mesh_broadcast(tensor, mesh, mesh_dim=mesh_dim)
[rank4]: File "/code/veScale/vescale/dtensor/_collective_utils.py", line 273, in mesh_broadcast
[rank4]: aysnc_tensor = funcol.broadcast(tensor, src=src_for_dim, group=dim_group)
[rank4]: File "/miniconda3-new/envs/llm-cuda12.4-vescale/lib/python3.10/site-packages/torch/distributed/_functional_collectives.py", line 153, in broadcast
[rank4]: tensor = torch.ops._c10d_functional.broadcast(self, src, group_name)
[rank4]: File "/miniconda3-new/envs/llm-cuda12.4-vescale/lib/python3.10/site-packages/torch/_ops.py", line 1116, in call
[rank4]: return self._op(*args, **(kwargs or {}))
[rank4]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2777, invalid argument (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank4]: ncclInvalidArgument: Invalid value for an argument.
[rank4]: Last error:
[rank4]: Broadcast : invalid root 4 (root should be in the 0..4 range)
Is this because the torch version is not compatible?