Skip to content

[Documentation] nixlbench ETCD barrier synchronization timing guidance needed #1022

@dmvevents

Description

@dmvevents

Problem

nixlbench fails with ETCD barrier synchronization errors when nodes don't start within close proximity:

Error in barrier wait... rank 0 completed 1/2 ranks)
Failed to synchronize at start barrier

Environment

  • Platform: AWS SageMaker HyperPod EKS
  • Setup: 2-node H100 configuration with ETCD coordination
  • Network: EFA-enabled

Details

When using kubectl exec to start nixlbench on multiple nodes, if there's >3 seconds delay between node startups (or AWS credentials expire between calls), ETCD barrier synchronization fails.

Suggestion

Add documentation guidance:

  1. Nodes should start within 3 seconds of each other
  2. Recommend using Kubernetes Jobs for simultaneous startup
  3. Document ETCD barrier timeout settings

Workaround

Start both nodes within 3 seconds:

kubectl exec node1 -- nixlbench ... &
sleep 1
kubectl exec node2 -- nixlbench ...

Or use K8s Jobs for proper simultaneous startup.

Successfully implemented with K8s StatefulSet + coordination script.

Reference: https://github.com/dmvevents/dynamo-workshop/blob/main/NIXLBENCH_TESTING.md

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions