-
Notifications
You must be signed in to change notification settings - Fork 183
Open
Description
Problem
nixlbench fails with ETCD barrier synchronization errors when nodes don't start within close proximity:
Error in barrier wait... rank 0 completed 1/2 ranks)
Failed to synchronize at start barrier
Environment
- Platform: AWS SageMaker HyperPod EKS
- Setup: 2-node H100 configuration with ETCD coordination
- Network: EFA-enabled
Details
When using kubectl exec to start nixlbench on multiple nodes, if there's >3 seconds delay between node startups (or AWS credentials expire between calls), ETCD barrier synchronization fails.
Suggestion
Add documentation guidance:
- Nodes should start within 3 seconds of each other
- Recommend using Kubernetes Jobs for simultaneous startup
- Document ETCD barrier timeout settings
Workaround
Start both nodes within 3 seconds:
kubectl exec node1 -- nixlbench ... &
sleep 1
kubectl exec node2 -- nixlbench ...Or use K8s Jobs for proper simultaneous startup.
Successfully implemented with K8s StatefulSet + coordination script.
Reference: https://github.com/dmvevents/dynamo-workshop/blob/main/NIXLBENCH_TESTING.md
Metadata
Metadata
Assignees
Labels
No labels