-
Notifications
You must be signed in to change notification settings - Fork 94
Description
Hello,
if I run prime on different number of nodes e.g. with 3 nodes each 2 ranks, I always get for nnodes the value 1. It doesnt matter how many nodes I have, it is always 1.
I execute the code with:
GLOO_SOCKET_IFNAME=xy GLOBAL_ADDR=xy GLOBAL_RANK=0 GLOBAL_UNIQUE_ID=0 GLOBAL_WORLD_SIZE=3 GLOBAL_PORT=8095 uv run torchrun --nproc_per_node=2 src/zeroband/train.py @configs/debug/diloco.toml
For the other nodes I change the GLOBAL_RANK and GLOBAL_UNIQUE_ID. (see PrimeIntellect-ai/prime#173 (comment))
The value is used here:
https://github.com/PrimeIntellect-ai/prime/blob/d57965b04574262815a174246707afd9615eed0c/src/zeroband/comms.py#L66-L70
nnode is created here
and if I print self.world_size it is always the value of the local_world_size so nnodes is 1.
Is this correct, or am I missing something?
Edit:
I guess it is, because I currently use only one node as a peer so the --rdzv_endpoint is always on the same node?