Skip to content

Gloo Backend Initialization Failure in Kubernetes Clusters Created with Gcluster toolkit #118

@Balaji-Natesan

Description

@Balaji-Natesan

ISSUE: The following dump stack was found in NeMo 2.0’s nemo-api-gke-torchrun.
/opt/pytorch/pytorch/third_party/gloo/gloo/transport/tcp/socket.cc:142] rv != -1. -1 vs -1. connect: Network is unreachable.

Details:

  • The error suggests that the Gloo backend was unable to establish connections due to network unreachability.

  • This typically occurs when Gloo cannot resolve the hostname or establish communication between nodes due to improper network interface configurations or restricted network access.

Missing Configuration: The absence of explicitly setting GLOO_SOCKET_IFNAME further limits Gloo's ability to determine valid interfaces.

Also, in addition we found that the GKE network created by cluster-toolkit has incorrect routing table.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions