ISSUE: The following dump stack was found in NeMo 2.0’s nemo-api-gke-torchrun.
/opt/pytorch/pytorch/third_party/gloo/gloo/transport/tcp/socket.cc:142] rv != -1. -1 vs -1. connect: Network is unreachable.
Details:
-
The error suggests that the Gloo backend was unable to establish connections due to network unreachability.
-
This typically occurs when Gloo cannot resolve the hostname or establish communication between nodes due to improper network interface configurations or restricted network access.
Missing Configuration: The absence of explicitly setting GLOO_SOCKET_IFNAME further limits Gloo's ability to determine valid interfaces.
Also, in addition we found that the GKE network created by cluster-toolkit has incorrect routing table.